ABSTRACT
INTRODUCTION
On June 2, 1997, the American National Institute for Standardization and Technology (NIST) proposed a competition to propose a new encryption algorithm to replace the aging and increasingly vulnerable Data Encryption Standard (DES). The new Advanced Encryption Standard (AES) chosen from the competitors was Rijndael [1 and 2] . Since becoming the AES, Rijndael has been the focus of countless analyses and has been implemented both in hardware and software for many different platforms. To accelerate the AES computation time, parallel computing is incorporated [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] .
In this paper, a design of parallel AES on the multiprocessor platform is presented. While most of the previous designs either use pipelined parallelization or take advantage of the Mix_Column parallelization, our design is based on combining pipelining of rounds and parallelization of Mix_Column and Add_Round_Key transformations. This model is divided into two levels: the first one is pipelining different rounds, while the second one is through parallelization of both the Add_Round_Key and the Mix_Column transformations. Previous work proposed for pipelining AES algorithm was based on using nine stages, while, we propose the use of eleven stages in order to exploit the sources of parallelism in both initial and final round. The paper is organized as follows: in Section 2, a description of AES algorithm and a survey of different designs for its implementation in parallel are detailed. Then, the proposed design is illustrated in Section 3. In Section 4, a performance evaluation of the proposed design is given. Finally, the paper concludes in Section 5.
RELATED WORK

Advanced Encryption Standard (AES)
The Advanced Encryption Standard (AES) algorithm is a symmetric block cipher which can convert data to an unintelligible form (encryption) and convert the data back into its original form (decryption). Both encryption and decryption consist of sequences of blocks each consists of 128-bits. Moreover, the cipher key for the AES algorithm is a sequence of 128, 192 or 256 bits. Internally, the AES algorithm's operations are performed on a two-dimensional (2-D) array of bytes called the State array. The State array consists of four rows of bytes, each containing "N b " bytes, where "N b " is the block length divided by 32 (the word size).
Description of the AES Algorithm
The AES algorithm consists of three distinct phases as shown in Figure 1 [3] :  In the first phase, an initial addition (XORing) is performed between the input data (plaintext) and the given key (cipher key).
 Then, in the second phase, a number of standard rounds (Nr-1) are performed, which represents the kernel of the algorithm and consumes most of the execution time. The number of these standard rounds depends on the key size; nine for 128-bits, eleven for 192-bits, or thirteen for 256-bits. Each standard round includes four fundamental algebraic function transformations on arrays of bytes namely:
(1) Byte substitution using a substitution table (Sbox) (2) Shifting rows of the State array by different offsets (ShiftRow) (3) Mixing the data within each column of the State array (Mix_Column), and (4) Adding a round key to the State array (Key-Addition).
 Finally, the third phase of the AES algorithm represents the final round of the algorithm, which is similar to the standard round, except that it does not have Mix_Column operation. For detailed information of the abovementioned transformations, the reader could refer to [1] . 
The Parallel Advanced Encryption Standard (AES)
Advanced Encryption Standard (AES) can be deployed in fully hardware [3] [4] [5] [6] [7] [8] [9] [10] [11] , hybrid softwarehardware [12] [13] [14] [15] [16] , and fully software implementations [17] [18] [19] . This fact allows parallelization of AES in different ways. In literature, parallelizing Rijndael has been visited many times for hardware implementation. In [4] , Yoo et al. presented a hardware-efficient design that increases AES throughput by making use of a high-speed parallel pipelined architecture. Yoo et al. used an efficient inter-round and intra-round pipeline design in order to achieve a high throughput in encryption. In each round, there are three pipeline stages, the first stage follows the byte-sub operation, the second one is located after the shift-row operation, and the last stage is before data output. Moreover, this design has one pipeline stage in key generation blocks. On the other hand, Hodjad el at. [5] introduce a design that has four or seven pipeline stages, one after a byte-sub operation and three or six in a byte-sub operation. In [7] , Ananth et al. present a fully pipelined AES encryption/decryption system that is fully unrolled in order to implement a very deep level of pipelining (i.e. all ten cipher rounds were unrolled.) For more designs for hardware implementation of AES, the reader could refer to [8] [9] [10] [11] .
On the other hand, AES in software-hardware co-design is performed by using extended special instructions and the other transforms are performed by general instructions [12] [13] [14] [15] [16] . S. Mahmoud [12] presented a parallel implementation for AES algorithm by using the MPI (Message Passing Interface) based cluster approach. MPI is one of the most established methods used in parallel programming mainly. This is due to the fact that the relative simplicity of deploying the method by writing a set of library functions or an API (Application Program Interface) callable from C, C++ or Fortran Programs. In [13] , So-In shows that the 16-bytes AES block can be individually encrypted. As an essential technique of AES parallelism is to execute parallel AES by applying each thread or each node into each AES block to establish a complete encrypted parallel block. This technique excludes the key expansion step required before entering the parallel state. So-In applies AES encryption in ECB mode for the sake of performance evaluation. Similarly, CTR mode can be encrypted without the dependency of the previous blocks, but not other modes.
Other designs that use instruction set to increase the efficiency of 32-bit processors for AES encryption algorithm could be find in [14] [15] [16] .
In [17] , Brisk et al. introduce an example of fully software implementation of AES. In their work, they derived the asymptotic sequential runtime for the algorithm and describe two parallel implementations. The first one is optimal in terms of time consuming and the other one is optimal in terms of cost. In the cost-optimal implementation, they sacrifice acceleration in order to reduce the number of processors required for encryption. Other examples of fully software implementations are presented in [18 and 19] .
In this paper, a design of parallel AES on the multiprocessor platform is presented. While most of the previous designs either use pipelined parallelization or take advantage of the Mix_Column parallelization, in our work, we design a parallel model for the AES algorithm. This model is divided into two levels. The first one is pipelining different rounds, while the second one is through parallelization both Add_Round_Key and Mix_Column transformations. In the next section, the proposed parallel AES design is presented.
THE PROPOSED PARALLEL ADVANCED ENCRYPTION STANDARD (AES) DESIGN
Advanced Encryption Standard (AES) algorithm has many sources of parallelization as mentioned in Section 2. In this work, we design a parallel model for the AES algorithm, this model is divided into two levels. The first one is pipelining different rounds, while the second one is through parallelization both the Add_Round_Key and the Mix-Column. In this section, the parallel design of the AES algorithm is explained, while in the next section its analysis is detailed.
The Parallel Encryption Model
Based on the AES description in Section 2.1, AES algorithm is divided into three distinct phases.
The first phase contains the initial round. The Second phase contains "Nr-1" standard rounds, in which each round includes four transformations namely: Byte_Sub, Shift_Row, Mix_Column, and Add_Round_Key. Finally, the third phase contains the final round. That is similar to any standard round, except that it does not have Mix_Column transformation. Both Byte_Sub, and Shift_Row transformations are executed sequentially because they operate on single bytes, independently of their position in the State matrix. On the other hand, Mix_Column and Add_Round_Key operations can be executed in parallel. While the Add_Round_Key operation is used to perform an arithmetic XOR operation, the Mix_Column transformation, which represents the kernel of the AES algorithm and consumes most of the execution time, is used to perform 64 XOR operations and 32 shift operations.
In this work, we design a parallel model for the AES algorithm, this model is divided into two levels. The first one is pipelining different rounds (from round zero to round 10), while the second one is through parallelization of both Add_Round_Key and Mix-Column transformations.
Pipelined Encryption Rounds
As shown in Figure 1 , round number zero (initial round) through round number "N r , N r =10" represent the individual rounds in the AES-128 encryption. The pipelining between these rounds will achieve a high performance implementation. The data generated in each individual round is used as the input to the next round. This is one of the easiest methods where high performance can be achieved in a very minimal amount of time, thus, reducing the overall design implementation cycle.
We assume that our system contains eleven stages {S 0 ,S 1 ,S 2 ,….., S 9 , and, S 10 }, and the total number of processors equals "M". Moreover, each processor has its local memory, and the processor and its memory are called processing element. The "M" processing elements connected to each other via multiport Shared Memory (SM). The content of a multi-port memory can be accessed through different ports simultaneously. In our work, each stage can be performed by M r processing elements PEs, where M r = M/11. Each group of "M r " PEs has a direct independent access to a certain memory module, and each PE has a dedicated path to each module in order to achieve a better performance. On the other hand, different stages are connected through pipelined stream. That is to say, the pipelined stream contains eleven functions each function is executed by a single stage. There is a pipeline stage between each round and the parallelization inside each round which will be described in Section 3.1.2. Our pipeline design is different from [4] by adding two stages S 0 and S 10 to the pipeline stream. Each of these stages are used to execute the Add_Round_Key transformation (this transformation consists of sixteen XOR operations), i.e. the design is fully pipelined. This technique of pipelining will increase the concurrency and reduce the total execution time.
Parallelization inside Individual Round
Each individual round consists of four transformations. As mentioned earlier, Mix_Column and Add_Round_key transformations can be executed in parallel. Add_Round_key consists of 16 independent XOR operations, therefore, it could be executed in parallel. In addition, Mix_Column transformation consists of 64 XOR operations and 32 shift operations. Mix_Column represents the kernel of the AES algorithm and consumes most of the execution time. This necessitates its implementation in parallel to reduce its execution time. In this section, the mathematical derivation of Mix_Column is discussed in details. In our design, "E" represents the matrix used for encryption, while "D" represents the matrix used for decryption. On the other hand, we assume that "B i " and "C i " are the input and output of the Mix_Column operation in case of encryption, and are inversed at the decryption process. In order to encrypt "L" number of data blocks ( 1 ≤ i ≤ L ), E, B i , and C i , for each block can be represented as follows: 
Mix_Column transformation is then represented by the following set of equations and is illustrated in Figure 2 :
This is repeated for the other three columns of the matrix. The above description shows that the elements of Mix_Column matrix can be computed independently. The Mix_Column transformation can be executed by more than one processor. The maximum number of processors is thirty-two processors in each stage. Figure 3 represents the proposed parallel design for the AES encryption operation, while Figure 4 describes the details of computing each matrix element in parallel. As shown in this figure, two processors can cooperate to compute one or more element c i,j . 
The Decryption Model
As the encryption operation, the parallelization of decryption can be done in two levels. In the first level of parallelism (pipelining different rounds), the decryption operation is done in the same way as the encryption operation (Section 3.1.1). In order to decrypt "L" number of data blocks ( 1 ≤ i ≤ L ), D, C i , and B i , for each block can be represented as follows: 
This is repeated for the other three columns of the matrix. As mentioned earlier, both Inv_Mix_Column and Add_Round_key transforms can be executed in parallel. Inv_Mix_Column transformation consists of 160 XOR operations and 192 shift operations. Similar to Mix_Columnu matrix, the elements of Inv_Mix_Column matrix can be computed independently. The Inv_Mix_Column transformation can be executed by at most 64 processors in each stage. Figure 6 describes the details of computing each matrix element in parallel when using 16 processors. As shown in this figure, four processors can cooperate to compute one or more element b i,j . In the next section, analysis of the proposed design is detailed.
ANALYSIS OF THE PROPOSED PARALLEL AES DESIGN
In this section, for both encryption and decryption operations, we discuss the mathematical derivation of the proposed parallel AES design on a pipeline architecture of eleven stages. In our design, "M r " processing elements cooperate to execute each stage (as discussed in Section 3). For simplicity, we assume a block and key sizes of 128 bits. 
Encryption Operation
The total sequential time "T ES " needed to execute the encryption operation is given by:
Where T Nr-1 = T Byte_Sub + T Shift_Row + T Mix_Column + T Add_Round_Key (14) T Nr = T Byte_Sub + T Shift_Row + T Add_Round_Key (15) T Add_Round_Key = 16 * T XOR (16) (T XOR : the time needed to execute one XOR operation)
T Shift_Row = 48 * T shift (17) (T shift : the time needed to execute one shift operation)
T Mix_Column = 16 * (2*T shift + 4*T XOR ) (18) T XOR = 6 * T shift (19) From equations (13, 14, 15, 16, 17, 18 , and 19), we deduce:
T ES = 880*T XOR + 10 T Byte_Sub (20) T Byte_Sub is very small and can be neglected. For "L" blocks, the total sequential time is given by:
T LES = L* T ES = 880*L*T XOR (21)
Pipelining the AES encryption rounds
Assuming that the total number of processing elements M =11* M r , and "M r " processing elements are used to execute each stage, the pipeline time "T pipeline " is given by:
T pipeline = L*t 1 + 9*t 2 + t 3 T Byte_Sub is very small and can be neglected. From Eqs. (22 to 26), the pipelined time "T pipeline " is given by:
Parallelization of Add_Round_Key and Mix_Column transformations
We assume that the total number of PEs that compute each round equals to "M r ", where 2≤ M r ≤32. Therefore, the time needed to execute Add_Round_key transformation is given by: From Eqs. (37-42) and with the assumption that T Inv_Byte_Sub is very small and can be neglected, the pipelined time "T pipeline " is given by: From the above tables, the following facts could be deduced:
-Tables 1(a) and 3(a) show that using pipeline increases significantly the system performance for the cases of encryption and decryption. In addition, as the number of blocks increases, for cases of encryption and decryption, the degree of improvement increases. --As shown in Table 1 (b) and 3(b), as the number of processors used to execute each stage (2 to 16) increases, the improvement degree increases irrespective of the block size. To obtain a reasonable efficiency, we will be satisfied with an improvement degree equals to 98%. Which can be satisfied when M r =8 for the encryption case and M r =16 for the decryption case. -Tables 2 and 4 show the effect of parallelizing Add_Round_Key and Mix_Column/ Inv_Mix_Column transformations on the system performance inside each stage. The comparison with the case of using only one processor is illustrated. As the number of processors increases, the total execution time decreases. In addition, the speedup increases for both encryption and decryption operations. Moreover, the improvement degree increases irrespective of the block size. This is true for L =10, 25, and 40. This leads to the conclusion that the proposed design is scalable and is suitable for real-time applications.
Previous work proposed for pipelining AES algorithm was based on using nine stages. In our work, we propose the use of eleven stages in order to exploit the sources of parallelism in both initial and final round. This enhances the system performance compared to previous designs. In addition, we use two-levels of parallelism: the first level is pipelining different rounds (from round zero to round 10), while the second one is through parallelization both the Add_Round_Key and the Mix-Column transformations. Using two-levels of parallelization benefits from the highly independency of Mix_Column/Inv_Mix_Colum transformation which leads to a better performance.
CONCLUSIONS
The Advanced Encryption Standard (AES) algorithm is a symmetric block cipher which operates on a sequence of blocks each consists of 128, 192 or 256 bits. Moreover, the cipher key for the AES algorithm is a sequence of 128, 192 or 256 bits. AES algorithm has many sources of parallelism. In this work we proposed an optimized version of AES algorithm. Both the encryption and the decryption algorithms have been optimized. In the present paper, we detailed a design for implementation of AES algorithm on a multiprocessor platform. While most of the previous designs either use pipelined parallelization or take advantage of the Mix_Column parallelization, our design is based on combining pipelining of rounds and parallelization of Mix_Column and Add_Round_Key transformations. This model is divided into two levels: the first one is pipelining different rounds, while the second one is through parallelization of both the Add_Round_Key and the Mix_Column transformations. Previous work proposed for pipelining AES algorithm was based on using nine stages, while, we propose the use of eleven stages in order to exploit the sources of parallelism in both initial and final round. This enhances the system performance compared to previous designs. Using two-levels of parallelization benefits from the highly independency of Add_Round_Key and Mix_Column/ Inv_Mix_Colum transformations. The analysis shows that using pipeline increases significantly the degree of improvement for both encryption and decryption by approximately 95%. Moreover, parallelizing Add_Round_Key and Mix_Column/ Inv_Mix_Column transformations increases the degree of improvement by approximately 98%. To obtain a reasonable efficiency, we will be satisfied with an improvement degree equals to 98%. This could be achieved using eight processors for each stage in case of encryption and sixteen processors for the decryption case. Since, the increase of number of processors will decrease the efficiency. The analysis shows that the improvement degree increases irrespective of the block size. This is true for L =10, 25, and 40. This leads to the conclusion that the proposed design is scalable and is suitable for real-time applications.
