Introduction
A variety of electronic products such as mobile phones, PDAs, smart cards and laptops are used to store, access, manipulate and communicate sensitive data [1] . It seems that protection of the processed data has become more and more important. In most cases, we can use different security protocols by employing different cryptographic algorithms to achieve these ends.
Cryptography algorithms refer to the building blocks used to guarantee the data security. Because of the intensive nature of these algorithms, high-performance is particularly important; at meanwhile, we usually adopt more than one algorithm in the real cryptographic application. It's necessary to change the algorithm at real time [2] .
Coarse-Grained Reconfigurable Arrays (CGRAs) can provide a solution to keep highperformance and flexibility at the same time. We design such a processor to meet these demands and Fig. 1 gives its architecture. It mainly consists of Compute Engine (CE) and Configuration Control (CC). When the processor works, there are primarily three flows: the configuration flow, the control flow and the data flow. The configuration information is sent to Configuration Memory (CM) as soon as the system is powered on. Reconfigurable Cell Array (RCA) is reconfigured and the data who comes from FIFO is calculated afterwards. When an algorithm ends, CC will transform RCA to different structure so that a new one can be implemented. We have mapped a lot of cryptographic algorithms thereon, such as AES [3] , SM3 [4] , SM4 [5] and ZUC [6] . Of all these algorithms, AES is a typical block cipher and it's widely used in the network and financial system. For that reason, we map the AES algorithm on the Crypto-Processor and describe the optimization of the algorithm implementation in step by step manner. The results we obtained are compared with the implementations on the FPGAs. We will show that our mapping results can reach better efficiency.
Algorithm Description
The Advanced Encryption Standard, is a replacement for the Data Encryption Standard (DES). Because of insecurity of the S-Box, NIST decided to give up DES and standardize AES in May 2002.
AES is an iterated cryptographic block cypher which has a variable block length of 128, 192 or 256-bits. The number of rounds to be executed in encryption or decryption operation, depends on the block length. When the key length is 128, 196 or 256-bits, the number is 10, 12 or 14.
In this work, we focus the implementation of 128-bits AES algorithm on CGRP. It performs block cipher encryption and decryption in Electronic Codebook (ECB) mode. In the encryption process, plaintext adds the initial key, and then goes through 10 rounds function. There is no Mix Columns in Round 10, which is different from Round1-9. The decryption processes in an opposite way.
Algorithm Mapping
AES algorithm can be divided into two main stages: Round-Key Generation and Round Operation, both of which are mapped to the Reconfigurable Crypto-Processor in a certain way.
The Crypto-Processor we designed mainly consists of 16 lines with each line as a Programmable Entity (PE). In Fig. 2 , we can see the architecture of the PE. We can swap any bit of the 256-bits data with the BENES networks. There are 32 8-bits ALUs in each PE. The four adjacent rows share one Look-up Table; therefore, we can process the SubBytes, ShiftRows, lookup tables (LUTs) or logical operations in each line. Some special operation is designed and used in the ALU. In order to reduce the use of LUTs, our models support the calculation in a finite field such as X2 operation which means {02}·A in GF (2 8 ). Every PE has three input ports and three output ports. The data source can be General Purpose Register File (GPRF), FIFO or the last line. The data upon computation is then sent to GPRF, FIFO or the next line. In this way, the data can be stored, exported to the processor or transferred to the PE in next line. This design can enhance the flexibility of algorithm mapping; in this sense, we can get a great ascension on performance. 
Key Expansion
The length of key in this paper is 128-bits. We assume that the original Round-Key is W 3 W 2 W 1 W 0 . Fig. 3 gives the flow path of Round-Key Generation. In this mapping method, we get a new 128-bits Key with 6 PEs.
The generation of key W 7 W 6 W 5 W 4 is shown as follows:
PoS(CENet2015)066
An Efficient Implementation of Advanced-Encryption-Standard Le Chang
(3.4) According to the algorithm described in Fig. 3(a) , we need to map the function F with 3 PEs. Afterwards, the generation of W 5 takes another PE. In General, 6PEs are the minimum size to calculate a new key from the old one. The new Round-Key is stored in the GPRF.
Round Operation
The round operation is the mainly time consuming part of AES algorithm and it will be remapped into the RCA upon completion of the key expansion. Fig. 3(b) shows the mapping of round operation.
It also needs 6 PEs to accomplish the mapping per round. It's obvious to implement the Sub-Bytes, ShiftRows or AddRoundKey with 1 PE. In the MixColumns, there is a transform equation:
(3.6) {02}·A is an operation in GF (2 8 ). It needs 1 PE to finish the transformation. To calculate the result of S 0j S ⊕ 1j , 1 PE is used. The 3-input XOR operation in the end takes advantage of another PE. In the final round, 2 PEs are enough to implement it because there is no MixColumns.
The Round Operation is the main computation process in AES, and our RCA has 16 lines, that is, we can only proceed two-round operation at a time. Its performance is poor and the use ratio of the RCA is low.
Mapping Optimization
In our mapping scheme mentioned in Section 3, the efficiency of AES may be not optimal. It's noticed that the bandwidth of each PE is 256-bits, and we make use of half at most occasions. There is a possibility that we can improve the efficiency by analyzing the critical path and increasing parallelism.
Optimization of Key Expansion
In the key expansion, we can transform Eq.1-Eq.4 like this:
It's obvious that the SubBytes in S-box is the critical path. Therefore we calculate B, C and D in advance. In this way, we get A, B, C and D after first PE. The values of B ⊕W 1 , B⊕C and B⊕D are calculated in the second PE and the look-up table results of A come out in the same time. After the third PE, we get a new 128-bits key. In general, we halve the cycles from 6 to 3.
Optimization of Round Operation
In the round operation, the result after AddRoundKey is like this:
(4.6) Equation (4.6) shows that the critical path of round operation is X2 operation in GF (2 8 ); so we divide the MixColumns into three parts to recombine them with less PEs. In the first part we calculate E and F; at meanwhile, it can process ShiftRows. In the second part, we get the X2(E) and G. It means that we add the round key in advance. The last part we'll see the final result of current round. Upon optimization, we reduce the number of PE from 6 to 4.
Simulation Results and Comparison
In this section we compare the performance of AES running on our CryptoReconfigurable-Processor and on several FPGAs [7, 8] . Our design is verified under SMIC 55 nm CMOS. We implement the optimized AES mapping thereon practically and obtain the performance. Table 1 According to the Table 1 , we can see that it achieves higher area efficiency than FPGA implementations. Upon optimization, the throughput of AES can increase twice. At the same time, we implement the SM3, SM4 and ZUC algorithms in an acceptable performance. The SM3 implementation can achieve 216Mbps, SM4 reaches 2.34Gbps and ZUC comes to 1.2Gbps; thus we can switch the algorithms in real-time easily with high throughput under the practical working condition.
Conclusion
To sum up, this paper proposed a new implementation of AES based on reconfigurable Crypto-Processor. The performance of this implementation is higher than FPGA. At the same time, the processor's more flexible because it can support the real-time reconfiguration. Up to now, we have implemented AES, SM3, SM4, ZUC and SHA-256. We will implement other algorithms and improve the architecture in future.
