Advanced Encryption Standard (AES) is the most popular symmetric encryption method, which encrypts streams of data by using symmetric keys. The current preferable AES architectures employ e®ective methods to achieve two important goals: protection against power analysis attacks and high-throughput. Based on a di®erent architectural point of view, we implement a particular parallel architecture for the latter goal, which is capable of implementing a more e±cient pipelining in¯eld-programmable gate array (FPGA). In this regard, all intermediate registers which have a role for unrolling the main loop will be removed. Also, instead of unrolling the main loop of AES algorithm, we implement pipelining structure by replicating nonpipelined AES architectures and using an auto-assigner mechanism for each AES block. By implementing the new pipelined architecture, we achieve two valuable advantages: (a) solving single point of failure problem when one of the replicated parts is faulty and (b) deploying the proposed design as a fault tolerant AES architecture. In addition, we put emphasis on area optimization for all four AES main functions to reduce the overhead associated with AES block replication. The simulation results show that the maximum frequency of our proposed AES architecture is 675.62 MHz, and for AES128 the throughput is 86.5 Gbps which is 30.9% better than its closest existing competitor.
Introduction
Regarding the importance of data transmission, especially large amount of data, having a safe and reliable encryption method is essential in order to protect data leakage. So, it implies that not only protecting the data is important, but also, for transmission of large amount of private data, achieving high-throughput encryption methods is of major concern. The traditional Data Encryption Standard (DES) and succeeding Triple DES (TDES), which are the older and lighter algorithms compared to AES, used 64-bit symmetric key algorithm and were ¯rst introduced in Ref. 1 .As a result of their short key length, DES and TDES were very vulnerable to some type of attacks like brute-force attack. In 1997, the National Institute of Standards and Technology (NIST) decided to replace TDEA with the Rijndael algorithm known as Advance Encryption Standard (AES), which was selected among 15 encryption algorithm candidates. Also, this algorithm became the Federal Information Processing Standard FIPS-197 that was given in Ref. 2 .
In this paper, we design a di®erent AES architecture by using a new pipelining approach by replicating nonpipelined AES blocks. This new approach uses a different mechanism compared to loop unrolling, which is used in other recent AES architectures demonstrated in Refs. 3-9, to create a chain of all instances of AES main loop. Unlike recent architectures, instead of establishing a chain of AES main loop, we replicate the mentioned AES main loop in a parallel manner. Accordingly, in our proposed architecture each replicated instance of AES main loop operates independently and receives dedicated data, which is called plaintext for encryption and ciphertext for decryption, and encrypts/decrypts it. Therefore, the stream of plaintext/ciphertext should be assigned to all AES blocks in a regular manner, as will be explained later. So, the top view of the proposed architecture behaves similar to pipelined systems. Although improving robustness against power analysis attacks is not the main goal of the proposed architecture, but the mentioned similarity can prove that not only our di®erent architectural point of view does not jeopardize the e±ciency of proposed AES architecture, but also it can consolidate the persistence of the architecture against side channel attacks. This improvement can be achieved because of the independence of each employed module in the architecture.
Furthermore, we implement a compact SBOX architecture as proposed in Ref. 10 where byte substitution is reimplemented to reduce memory-based substitution. The mentioned SBOX architecture is modi¯ed to ful¯ll pipelining structure and avoid throughput reduction. Also, shift-row is implemented with no logic by using a bit reordering method. Also, mix-column is implemented based on Galois Field (GFð2 8 Þ) multiplication to reduce area overhead. Although the proposed design uses parallelism which replicates instances of AES main loop, and imposes corresponding area overhead, it only occupies 4,515 slices of an XC7VX690T ¯eld-programmable gate array (FPGA), and we achieve 86.5 Gbps throughput.
The rest of the paper is organized as follows. In Sec. 2, a summarized background on AES is described. Also, previous work on AES algorithm is highlighted. Section 3 represents the proposed architecture and shows the pipelining structure in detail.
Additionally, we will explain how the fault tolerant structure is implemented. Section 4 contains experimental results and analysis. Finally, Section 5 concludes the paper.
Background and Related Work
AES algorithm can operate with 128-, 192-, or 256-bit keys, and key length is represented by N k ¼ 4, 6, or 8 multipliers of 32-bit words. Based on these key lengths, AES iterates N r ¼ 9ðþ1Þ, 11(þ1), or 13(þ1) rounds, and each round, except the additional (þ1) round, has four steps including sub-byte, shift-row, mix-column, and add-round-key. The additional (þ1) separated round has three out of four steps including sub-byte, shift-row, and add-round-key, while the mix-column is omitted. It should be noted that this additional round is the last round of encryption/ decryption. Also, before all iterations, there is a single add-round-key step. In an AES calculation °ow, after expanding required keys, and based on the length of input key, plaintext/ciphertext is encrypted and ciphertext/plaintext is produced. Symbolically, the AES algorithm's main calculations are performed on a 2D array called state, whose elements are bytes of the whole data. The ¯rst step of encryption is subbyte, which performs a substitution for each byte of plaintext. The second step is shift-row, which rotates each row of input state by a speci¯c o®set whose value is the index of that row. The third step is a mix-column block, which applies Galois multiplication on the current state. The last step is add-round-key, which is an exclusiveor over current state and corresponding key. These steps are demonstrated in Fig. 1 . Additionally, the key expansion contains three functions, i.e., rotation in ¯rst row, substitution in given row, and an exclusive-or. Similar to encryption °ow, decryption has four steps. Also, these steps are merely inverses of encryption steps, and the sequence of these steps are permuted. Therefore, there is no critical di®erence between encryption and decryption. Accordingly, instead of referring to both °ows, henceforth we just refer to encryption °ow.
Implementation of AES can be divided into two main categories: pipelined and nonpipelined structures. If the main feature is throughput, the architecture tends to pipelining structure, and some techniques like loop unrolling will be employed. Otherwise, for constrained area architectures, nonpipelined and sequential structure will be preferred to minimize area occupation. Therefore, there are some di®erences in the performance of pipelined and nonpipelined methods, as discussed in Ref. 11 , which conclude that the throughput can be approximately 90% better in pipelined structure but the cost is a 90% area overhead.
In Ref.
3, a two-stage pipeline was implemented which achieved 66.1 Gbps throughput. The focus of this paper is on minimizing the number of combinational logic levels. They featured their implementation with focusing on both standard four-LUT and six-LUT slices in FPGAs. Even though they assessed routing delay in FPGAs to minimize the path delay, they abstained from illustrating placement and routing results, which are required for exact assignment. Also, they implemented and analyzed AES design on di®erent FPGAs and achieved up to 66.1 Gbps throughput. A drawback of this design is its lookup-table-based implementation of sub-byte. As sub-byte is the largest logic level among all AES sub-modules (¯ve levels for 4-LUT FPGAs and three levels for 6-LUT FPGAs). In Ref. 12, three pipeline stages have been used, targeting high-throughput. It assessed byte substitution with two di®erent solutions. The ¯rst solution uses lookup table and the second one is based on computation in Galois ¯eld by multiplication inverse and a±ne transformation.
Most of the earlier AES implementations which use pipelined structure have similar fabrics, and are just faced with a little modi¯cation in details. For instance, another pipelined AES implementation was discussed in Ref. 7 , which is similar to overall architecture of Ref. 12 , that implemented a three-stage pipeline and some modi¯cations in combinational logics.
Some implementations, such as Ref. 4 , try to reduce the area by mapping GFð2 8 Þ to GFð2 4 Þ and also make the AES protected against power attacks. Their main focus is on storage area networks (SANs) and try to prevent di®erential power analysis (DPA) attacks and glitch attacks. As we mentioned earlier, a goal of the current AES implementations is protection against power analysis attacks. Most of the corresponding implementations have concentrated on masking method in sub-byte module as were discussed in Refs. 4, 12-14. A mapping from GFð2 8 Þ to GFð2 4 Þ occurs before all AES main functions are implemented in Ref. 4 , and after completing the encryption/decryption, the ciphertext remaps from GFð2 4 Þ back to GFð2 8 Þ. A drawback of this design is the overhead associated with GFð2 4 Þ to/from GFð2 8 Þ conversion, which reduces throughput and increases area overhead compared to similar architectures. Also, to reduce area overhead, which is due to ¯eld conversion, sub-byte has been implemented by using BRAM, which can reduce occupied area resources. In Ref. 9 , a parallel-pipelined AES architecture is designed for 100 G Ethernet applications. In the mentioned architecture, both parallelism and pipelining are employed to achieve higher throughput. Therefore, both loop unrolling (for pipelining) and replication of AES main functions (for parallelism) techniques are used. In contrast, we use parallelism (replication of AES main functions) as a tool to realize pipelining structure, without using both pipelining and parallelism as two synergetic methods to achieve better throughput.
Although our focus in this paper is to achieve a high-throughput system, we also try to design a system with the lowest area overhead. For this purpose, some modules, like sub-byte which has more delay than other modules, are implemented with more emphasis on performance. On the other hand, we will try to minimize area occupation for some modules, like mix-column and shift-row which are simple and have adequate throughput. We implement a fast composite-¯eld sub-byte adopted from Ref. 10 , to reduce the highest critical path delay in our proposed architecture, and maximize the throughput. Additionally, based on our new approach to implement pipelining structure, since all modules are independent, we can remove single point of failure problem. In fact, if some parts of AES do not work correctly, other parts can continue the AES operations, which leads to a fault tolerant AES architecture.
Proposed Replicated AES Architecture
All existing and common pipelined designs use loop unrolling to expand all N r ¼ 9ðþ1Þ, 11(þ1), or 13(þ1) rounds, and add registers between each round, which is demonstrated in Refs. 3, 7,and 12.
These registers establish a pipelined data path to encrypt plaintext or decrypt ciphertext. On the contrary, as we mentioned before, our proposed design realizes the essence of pipelining structure without using loop unrolling or intermediate registers.
Instead of loop unrolling, we replicate each round of AES, such that all can operate in parallel and simultaneously.
The main di®erence between loop unrolling and our replication method is the assigned task to their modules. Loop unrolling contains all iterations, and when we use loop unrolling, each iteration operates as a deterministic and ¯xed step of encryption/decryption over all plaintexts/ciphertexts. On the other hand, replication contains instances of encryption/decryption module, and when we use replication, each instance will encrypt/decrypt a set of plaintexts/ciphertexts. In other words, unlike loop unrolling, all instances in replication method are independent, and each instance can perform encryption/decryption thoroughly. As we can see in Fig. 2(a) , each encryptstep is an instance of corresponding replication, which has no dependency to other encryptsteps. So, each instance can operate independently. In contrast, in loop unrolling technique, all encryptsteps are chained via intermediate registers, which results in dependency of all encryptsteps.
In order to ful¯ll pipelining functionality in proposed method, the number of replications should be greater than or equal to N r . For instance, in AES128 which has N r ¼ 10ðþ1Þ rounds, we need to replicate at least 10 rounds. The overall architecture of our proposed AES is illustrated in Fig. 2(a) . As we can see in Fig. 2(a) , each replication of AES is named an encryptstep and each encryptstep has four submodules, i.e., sub-byte, shift-row, mix-Column, and add-round-key, as illustrated in Fig. 2(b) .
For a thorough encryption/decryption scenario,¯rst of all we have an add-roundkey module which operates as an exclusive-or over plaintext and incoming key. Afterward, a loop with N r iterations including four sub-modules, which is called the iteration phase, is executed, and ¯nally an additional sub-module with three out of four sub-modules will be executed in the last step. The above-mentioned last step is similar to Fig. 2(b) , and just mix-column is omitted. In all existing pipeline-based AES architectures, these steps are chained by injecting intermediate registers. But if we want to realize pipelining in our design, we need to inject all plaintexts in a sequential order. Additionally, when a speci¯c encryptstep with its dedicated data is in iteration phase, it cannot start new iteration. So we need to wait for at least nine encryption cycles so that all iterations for this dedicated plaintext are done. Therefore, for achieving pipelining structure in AES128, we need to replicate encryptstep for nine times, and it should be replicated 11 and 13 times for AES192 and AES256, respectively.
As was mentioned earlier, for realizing pipelining in our architecture, we should inject stream of data with a speci¯c and regular sequential behavior. In encryption mode, we ¯rst inject a plaintext to the ¯rst encryptstep. In an encryption cycle, the ¯rst round of AES encryption will be applied on an injected plaintext in encryptstep 1. Afterward, in the second encryption cycle, we inject the second plaintext to encryptstep 2.
It should be noted that in this cycle, the second round of AES will be applied on the ¯rst injected plaintext in encryptstep 1. In the third encryption cycle, we inject the third plaintext to encryptstep 3. Similarly, it should be noted that in this cycle, the third round of AES will be applied on the ¯rst injected plaintext in encryptstep 1, and the second round of AES will be applied on the second injected plaintext in encryptstep 2. This injection procedure will be continued to ninth encryptstep. When a new data is injected towards the ninth encryptstep, all rounds of AES encryption in encryptstep 1, eight steps of encryptstep 2, etc. and also the ¯rst round of AES encryption in encryptstep 8 have already been accomplished. So, the calculated data of encryptstep 1 is ready for the last (þ1) round of AES which is shared between all steps. After this cycle, ciphertext of the ¯rst injected plaintext is ready. At this moment, rounds of AES encryption in encryptstep 2 are done. So, the calculated data of encryptstep 2 is ready for the last (þ1) round of AES which is shared between all steps. Also, a new plaintext will be injected to encryptstep 1 at this moment. This speci¯c and regular sequential structure in the proposed architecture resembles systolic array architecture. Also, all encryptsteps are independent and each module has a dedicated data in each period of time. Figure 3 shows timing diagram and the overall manner of our proposed architecture. As we can see in Fig. 3 , the pipelining structure is realized in the proposed system.
As we mentioned, sub-byte module is the main obstacle to achieve high-throughput in our proposed architecture. Therefore, we implement a composite-¯eld-based subbyte as proposed in Ref. 10 . The main task in composite-¯eld computation is multiplication inverse. By utilizing a sub-section of the composite-¯eld implementation adopted from Ref. 10 , we achieve considerable improvement in critical path delay.
Additionally, as it can be seen in Fig. 2(b) , we inject an extra intermediate register between sub-byte and shift-row modules. Since sub-byte has the largest logic depth in AES sub-modules, as shown in Table 1 , we split the critical path in encryptstep by adding an extra register. By doing so, each encryption cycle is executed in two cycles. Therefore, the number of encryptstep instantiations should be doubled. Accordingly, as it will be illustrated in experimental results, our proposed architecture requires more resources, nevertheless it achieves better performance.
Inserting an intermediate register in encryptstep demonstrated in Fig. 2(b) , which approximately halves the critical path delay, causes our proposed architecture performance to roughly double. This is achieved at the expense of a low area sub-byte depicted in Fig. 4 to compensate area overhead, and provide nonrestricted performance for our proposed architecture. This architecture is implemented by decomposing GFð2 8 Þ to GFð2 4 Þ 2 with an additional isomorphic mapping and its inverse similar to Refs. 10 and 15. Also, it should be noted that based on an analytical view, our proposed architecture has no degradation against power analysis attacks like DPA attacks. First of all, as it was mentioned, the primary goal of our proposed AES architecture is performance e±ciency. Therefore, the presented architectural point of view, should at least maintain the robustness against side channel attacks. Since the structure of each encryptstep is exactly equivalent with baseline AES engine similar to Refs. 2, 3, and 6, it can be concluded that robustness is not degraded in the proposed architecture. As the second evidence, using masked sub-byte is an implemented solution, considered for both performance and robustness against side channel attacks, which is illustrated in Ref. 4 . Consequently, not only we have same behavior like baseline AES engine based on the ¯rst reason, but also we can achieve an e±ciency against side channel attacks.
Another implemented feature in our proposed architecture is the capability of detection and correction of faulty encryptstep, which prevents both performance degradation and single point of failure problems. As mentioned before, the major di®erence between our proposed architecture and all existing pipelined architectures is the independency of all replicated modules, i.e., data in each replicated module has no dependency to those of other modules. Therefore, we can use a fault detection method to check which encryptstep computation is faulty, and based on the response to it, we can replace faulty encryptstep with a fault-free one. So, we can eliminate all stalls and delays caused by the faulty encryptstep.
We use a low cost four-step concurrent fault detection method in each AES round adopted from Ref. 16 , which has been implemented based on parity checking methodology. Generally, the method includes one step for computing parity in input and one for checking parity in output. The two remaining steps are parity modi¯-cation in two AES sub-modules: sub-byte and add-round-key. To put it in a nutshell, whenever a data is thoroughly changed, it needs to modify its parity. So, in AES submodules, data is merely changed in sub-byte and add-round-key, and we have no modi¯cation in mix-column and shift-row modules, because shift-row is a permutation of bytes and mix-column is a constant multiplication in Galois¯eld. Accordingly, we should compare parity of input with modi¯ed parity after AES computation. The overall architecture of this detection module is illustrated in Fig. 5 .
Based on this fault detection module, we add a fault-aware controller to assign input data according to correctness of each encryptstep. In fact, this controller can check the situation of each encryptstep, and if the current encryptstep, which should be loaded with new plaintext, detects a fault based on the parity checking methodology, it will be loaded with garbage data instead of input data.
Moreover, using this scheme does not interrupt pipelined operation. We can establish a methodology to reserve some encryptsteps by replicating them more than the number of rounds (2 Â N r ). This methodology is used to replace faulty modules with reserved fault-free modules. In this case, the number of faulty modules might be more than the number of reserved modules. So, we should expect that performance degradation may happen in this case, although, the probability of this case is rare. Therefore, our proposed design has neither performance degradation nor any single point of failure problem when it encounters a faulty encryptstep.
However, in previous AES designs, when a part of AES calculation became faulty, it should be executed again due to dependability of each part of calculation to another part which can be seen in Refs. 17 and 20. In the best case, the activated 
Implementation Results
We have implemented the architecture presented in this paper in Verilog HDL, and evaluated the results reported by Xilinx ISE 14.6 by considering di®erent FPGA devices. Since we focus on throughput, the illustrated results in Table 2 is compared to the best previous designs to show the e±ciency of ours against its competitors. It should be noted that, in Table 2 , there is one column, namely slowest FFs to FFs (placement and routing (PAR) phase), which is a static timing analysis parameter calculated via Xilinx ISE design tool. So, we can extract the most critical path after PAR phase to estimate the maximum possible frequency of our proposed design. As its name implies, this parameter estimates maximum delay between all FFs, which can be used to calculate maximum possible frequency to implement a design on targeted FPGA device.
It is worth mentioning that in architectural point of view, there are two general classi¯cations in AES design, i.e., high-throughput and e±cient area-utilized architectures. Therefore, there are two basic parameters to evaluate the experimental results. The¯rst one is maximum possible frequency, which is calculated based on its most critical path. The second one is the amount of resource consumption of mentioned architecture, which is speci¯ed based on the utilized LUTs and FFs in FPGA device. Similar to Ref. 3, we evaluate both parameters to show that not only can our design be considered as a utilized-area architecture, but also we can achieve a considerable throughput. Unlike previous pipelined AES, we removed extra intermediate injected registers, and used a simple controller, which is composed of a counter, to assign plaintext based on the mentioned behavior of replicated sub-modules. By doing so, we establish a more optimized balance between consumed LUTs and consumed FFs. As we know, all FPGA devices consist of a huge number of slices. Similarly, all slices consist of a prede¯ned LUT and FF pair. In most cases, in each So, PAR phase in implementation cannot assign a close to optimum location for required slices, which leaves a signi¯cant proportion of consumed slices not fully utilized. As a result, in considerable number of slices, either LUTs or FFs are left unused. Therefore, the e±ciency of implementing AES may decrease. On the contrary, in our proposed architecture, most of the slices are fully utilized by employing both LUT and FFs, which results in less wastage of FPGA resources. This yields in close to optimum area design. Table 3 shows the exact maximum frequency of our proposed design after considering PAR phase. We can conclude two main features based on our experimental results. First, we achieve a higher throughput against Ref. 3 thanks to our new approach to implement pipelining, with similar performance-cost ratio. Another feature which illustrates more e±ciency in our proposed architecture is improvement after PAR phase against other designs. As we know, PAR phase will constrain our design to place consumed slices on speci¯c parts of FPGAs. Because we establish a wellbalanced architecture between FFs and LUTs, mapping phase and thereinafter PAR phase have lower complexity. Therefore, both mentioned phases have fewer degradation against other competitors.
Conclusion
The essential requirement of large amount and secured transmission encourages the researchers to concentrate on high-throughput AES algorithms. The architecture presented in this paper is a high-throughput FPGA-based AES, which is pipelined according to a di®erent approach achieved from a particular behavior in parallel structure. Thanks to this di®erence, we achieve 86.5 Gbps throughput which is 30.9% better than its closest existing competitor so far. Also, independency of AES submodules, which is a result of our di®erent pipelining architecture, enables us to add a controlling structure to implement a concurrent and lightweight fault tolerant structure for our proposed architecture.
