Abstract-In this paper, we propose a dynamically run-time reconfigurable power aware cryptographic processor for secure autonomous encryption. The design proposes the implementation of a dynamically reconfigurable AES cryptography process on an FPGA. The proposed design encompasses a microarchitecture which is variously power, latency, and throughput optimized via hardware acceleration and partial reconfiguration by a multi-level autonomic controller and a data router to enable tradeoffs under changing operational requirements within resource constraints. The multi-level controller decides on the appropriate configuration based on varying operational workloads to characterize the effect that time-varying task parameters have on the hardware architecture, to enable a run-time tradeoff of performance and resources usage (Key length, computational efficiency, latency and throughput).
I. INTRODUCTION
The rapid growth of portable electronic devices with limited power and physical size limitations has opened a vast area of low-power and compact circuit design opportunities and challenges for VLSI circuit designers. Cellular phones, PDAs, and smart cards are examples of portable electronic products that are becoming an integral part of everyday life. The popularity of these devices introduces a new class of portable applications that are power aware and dynamically adapted to task needs in resource constrained hostile environment where secure collection, processing and transmission of time-critical sensitive data is a necessity. This class of portable applications requires the endowment of a light-weight advanced cryptographic system such as a Trusted Platform Module [10] , [11] , [12] , and [13] implemented on-board of the mobile device to assure data integrity, point-to-point communication, authentication [14] , and [15] , confidentiality, and nonrepudiation. Furthermore, run-time adaption to task dynamics in resource-constrained environment demands the inclusion of critical in situ reasoning and collaborative decision-making capabilities to these mobile devices.
U.S. Government work not protected by U.S. copyright
Innovative dynamic reconfigurable cryptographic architectures are needed to perform basic tradeoffs in speed, security, throughput, and power in these environments.
Current NSA Type1 Cryptographic devices known as High Assurance Internet Protocol Encryptor (HAIPE) are unsuitable for on-board encryption in the mobile device, due to their limited computational capability and power. In this paper, we propose a dynamically run-time reconfigurable power-aware AES cryptographic coprocessor useful to ensure secure communications in small, powerlimited mobile devices. The proposed design is capable of assessing the current needs of a task, at run time, with respect to power, throughput, latency, and other information assurance variables. After this assessment, the design is capable of deploying the necessary AES cryptographic configurations accordingly.
This paper is organized as follows: an overview of the AES algorithm is presented in section II. Section III introduces the system architecture for the proposed FPGA-based AES coprocessor. Section IV describes the implementation details of the reconfigurable AES engine, followed by an overview of the Vortex data router [15, 16, 17, 18] in Section V. Section VI describes the resource performance controller of our design. Workflow simulations are presented in Section VII, followed by performance validation and tradeoff analysis in Section VIII. Section IX presents related work and section X concludes the paper.
II. AES OVERVIEW
The Advanced Encryption Standard (AES) [1] is a symmetric cipher that processes data in 128-bit blocks. AES mandates a block size of 128 bits and a choice of key size from 128, 192, and 256 bits with these variants referred to as AES-128, AES-192, and AES-256, respectively. AES implementations use 10, 12, or 14 rounds, depending on the specified key size. The block state in AES is maintained as an array of 16 bytes, organized in a 4x4 block. Initially, the state is filled column by column from the input block. The state is transformed in N r rounds into a final state, which is then read out, column by column, as the output block. For encryption, all but the final round are comprised by an identical sequence of operations on the current block state. The final round is different in that it excludes the MixColumns operation.
978-1-5090-5252-3/16/$31.00 ©2016 IEEE • Each column of the state has MixColumn applied to it Figure 1 shows the basic structure of AES.
III. SYSTEM ARHITECTURE FOR THE AES COPROCESSOR
Previous AES implementations using ASIC or FPGA have primarily focused on achieving higher throughput within a smaller chip area. As such, their operation is indifferent to the potentially changing operational requirements for dynamic security tasks and their collaboration needs. In this section, we describe our proposed system architecture for a dynamically reconfigurable AES cryptographic processor which is illustrated in Fig. 2 . Our cryptographic architecture is comprised of several components: an AES engine, a multi-level resources optimization controller and Vortex data router [15] . Data encryption and decryption are handled by the AES engine which is capable of supporting multiple levels of security. The multi-level task requests to drives dynamic on-chip configuration which enables run-time tradeoffs for security and performance parameters (key length, throughput, latency, and power consumption) in resource-constrained environments. Using the Vortex data router, we propose the construction of a cryptographic coprocessor as a set of distributed AES cores, in which the different AES cores will be interconnected through the robust network-on-chip communication infrastructure. The proposed architecture will be capable of dynamically assessing the current needs of a secure task with respect to chip area, security, speed, In addition to adapting many of the previous techniques in our architecture which maximizes throughput, we have exploited another innovation that further increases throughput. The new design explores the use of partial reconfiguration which allows runtime swapping of AES hardware modules to achieve nearoptimal configuration. Each AES core features a set of predetermined performance parameters (e.g. throughput, latency, area, power consumption). Multiple AES cores are dynamically instantiated to address the performance needs for a specific task.
IV. AES CRYPTOGRAPHIC ENGINE
The AES engine proposed in this paper is designed to provide a single unified hardware core that is parameterizable for synthesis to allow tradeoffs in throughput, area, latency, and power consumption. The AES core is capable of supporting various security levels by deploying different key sizes such as 128-bits, 192-bits, and 256-bits. The AES engine was implemented as a set of five primary components, corresponding to the core operations of the AES algorithm: Key Expansion, AddRoundKey, ShiftRows, SubBytes, and, MixColumns.
A. Key Expansion Module:
The Key Expansion module is implemented as an iterative process, there is no flexibility for parallelization of this step as each expanded key word W[i] depends at least inpart on previously expanded key words W[i-1] and W[i-Nk], where N k is the determined by the key size. As in the algorithm specification, our key expansion implementation is identical for both AES-128 and AES-192 and with only the necessary minor modifications for the AES-256 as illustrated in Figure 3 .
B. AddRoundKey Module:
The Add Round Key module performs the XOR operations of the AES state with the round key. In our design, this module operates in one of three modes: BYTE, COLUMN, and STATE. In the BYTE mode, an 8-bit XOR operation is applied to one byte of the state in each cycle, thus, a total of approximately 17 cycles are required to complete operation on a single block of data, one for each 8-bits XOR operation of the state and one cycle for setup. In COLUMN mode, a 32-bit XOR operation is performed on one column of the state in each cycle and a total of 5 cycles are required. Finally, in the STATE mode, the entire AES state is processed at once, a single 128-bit XOR operation is performed on the state requiring only one cycle to complete.
C. SubBytes Module:
The SubBytes operation is performed as a simple byte-level substitution. All 16 bytes of the state are byte-substituted based on a fixed look-up table. This substitution is reversible such that SubBytes(SubBytes(x)) = x. The SubBytes module performs this lookup-table based replacement of each byte in the state. The SubBytes operation requires two additional cycles in order to address and access the block RAM memories on FPGA. For the byte lookups. The SubBytes module, like the AddRoundKey module, also supports the BYTE, COLUMN, and STATE modes of operation, each providing the same granularity of parallelization.
In this module, the additional cycles of latency associated with the memory access are hidden in both COLUMN and STATE modes of operation by overlapping the memory access with the substitution of the state bytes once the memory output is available. This module requires 18, 6, and 3 cycles respectively to complete when operating in BYTE, COLUMN, and STATE modes respectively.
D. ShiftRows:
This operation is a simple intra-row shift of bits to the right in each row of the state; bits shifted out the right side of the state are shifted into the left side of the same row. This operation is implemented through the routing interconnect, but connecting the output bits of one operation, to the appropriate (shifted) input bits of the next operation. This technique does not require any additional hardware, and as the next operation may not begin until all bits have been shifted, there is no tradeoff to be had in parallelization. In fact the performance decreases with no hardware reduction.
E. MixColumns Module:
The MixColumns operation is the most computationally intensiveoperation of the basic AES operations. Within each column of the AES state, each byte is mutated by a combination of affine transformations of all bytes in the same column. Each operation of this transformation is performed conditionally modulo 0x11b. Each byte of a column, B[i] is transformed using equation (1).
In our design, the modular addition is performed using a simple XOR operation, and the modular multiplication is implemented as the addition of partial products/sum each is carried out conditionally modulo 0x11b (e.g.
The MixColumns module allows for the same modes of operation as AddRoundKey and SubBytes: BYTE, COLUMN, and STATE. This MixColumns implementation requires an additional cycle to generate the appropriate multiples of each byte in the column; this additional cycle of latency is hidden in both COLUMN and STATE modes by overlapping these multiple-generations with calculations from previous operations V. VORTEX DATA ROUTER Vortex [15] is a reconfigurable and highly programmable Network-on-Chip (NoC) platform. It is based on the NoC paradigm, its main distinguishing attributes are the optimized network interfaces that ease the mapping of dataflow graphs onto System-on-Chip (SoCs). The network interfaces provide a transport layer on top of a packet-switched NoC to support frame-level transactions while abstracting the underlying physical interconnection. The Vortex provides two types of network interfaces. Regardless of the type of attached network interface, Vortex uses a 16-bit device address, device-id, to refer to an interface attached to one of its ports. A major benefit of the
For AES-256, an addition operation is added for every 4 th expanded key that is not a multiple of N k : Where Nw = (Nr+1)Nb, and Nr is the number of rounds Vortex platform is the integrated network awareness and support for application flows. A flow describes any sequence of operations required to complete a designated computation. Flows offer three major benefits. First, a large sequential computational process can be decomposed into multiple small operators, where each operator is a general purpose and reusable component. Second, by overlapping the computation of data with the transport of that data between endpoints (i.e. memoryto-memory transfer) the potential to hide computational latency with communication latency increases. Finally, the data flow representation of a computation can be easily mapped to the network architecture, making design automation tractable. In addition, the support of non-trivial flow patterns including converging and diverging flows significantly extends the applicability of dataflow processing in SoCs.
VI. PERFORMNCE OPTIMIZATION CONTROLLER
Traditional FPGA cryptographic engine designs are undertaken from the viewpoint of static constraints, where the goal is to achieve the maximum throughput under a fixed power constraint, or alternatively the minimum power usage for a fixed throughput requirement. This work takes a dynamic and reconfigurable approach, reflecting the desired capability to adapt to the desired capabilities in the operational environment. For example, we can prioritize modules with better power efficiency when the battery of a mobile sensor is running low. Alternatively, the device can switch to a higher throughput when multiple sensors are generating data to be sent to storage on a hard drive, as opposed to the cryptographic requirements being limited to the slow communications link.
A. GRAPHICAL USER INTERFACE FOR PERFORMANCE CONTROL
We have designed a graphical user interface (GUI) which allows the user to set priorities for the different performance outputs, and to see the capabilities of the device and the results of the optimization. We consider the two performance characteristics of throughput and latency for both encryption and decryption using AES, as well as the overall power consumption.
For each of the performance characteristics under examination, a user must decide one of two optimization constraints. First, the user must decide if a strict performance requirement must be met for the characteristic in consideration. For example, a user may require that the device obtain a total encryption throughput of 500 Mbytes/sec. Secondly, for the characteristics which do not have strict performance requirements, the user can set a desired relative priority towards that characteristic. In the above example of fixed throughput, the user might then be able to see the tradeoffs between decreasing the total power, versus decreasing the latency of the computation. Configuration choices available to a user are three performance parameters which can be traded off against one another:
• Total Power Consumption • Encryption/Decryption Throughput • Encryption/Decryption Latency A user will set the requirements and relative priorities, and can then run the optimization routine. If requirements which are not all simultaneously achievable have been selected, the GUI will alert the user of this fact. Additionally, the GUI displays a slider bar, illustrating where in the total possible range of performance outputs the selected configuration lies. At this point in time, encryption and decryption are treated symmetrically, i.e. the same core used for encryption will be chosen for decryption, with a proportion available to be chosen by the user.
B. OPTIMIZATION PROCEDURE
The optimization procedure consists of choosing which and how many of each of the core types to instantiate on the FPGA hardware. The selection of different performance parameter requirements yields various optimization objective and constraint functions, but each takes the form of a discrete optimization. For example, consider the general optimization problem max n ∈ ℤ ∑ s.t.∑ < where i is an index running over the total set of I core instantiations, where each core i has a throughput of T i and uses FPGA resources W i , and n i are constrained to be non-negative integers. This is an instance of the classic "Knapsack Problem", specifically the unbounded knapsack problem variant (UKP), which in general requires exponential computation complexity to solve. However, it is possible to compute as precise approximate solutions as are required via a polynomial time dynamic programming algorithm: If all of the weights w i are integers, the dynamic programming solution requires O(IW) time. For the specific optimization problems solved to choose how many instantiations of each core to place on an FPGA, a few additional considerations must be taken into account. First, there exist three kinds of chip resources, each of which place requirements on the number of cores which can be used: slice registers, slice LUTs, and block RAMS. This adds more constraints of the form ∑ < to the optimization. This variant is called the multi-dimensional knapsack problem, but still allows solution in reasonable time using dynamic programming methods.
One additional constraint on the optimization problem concerns the total number of cores which can be instantiated on the FPGA simultaneously. We have chosen to use the Vortex router as a solution to allow for run-time reconfiguration. A single Vortex allows a maximum of eight ports, one of which is required for off-chip communication, so that with one vortex, a total of seven cores can be instantiated. Placing a second and each additional vortex on the FPGA will allow six more cores to be added (since each Vortex needs to reserve ports to communicate with the other Vortex routers already on the device). This constraint is also incorporated into the dynamic programming routine.
To deal with optimization over different latency values, we simply restrict the selection of cores that it is possible to include in the instantiation set to be those who have latencies less than the maximum allowable value. To perform a tradeoff between different possible latency priorities, we compute the knapsack problem multiple times for a selection of different latencies, and then choose the appropriate weighted solution. A similar method is used to select for the tradeoff between power and throughput.
VII. TESTBED AND WORKFLOW SIMULATION
Our testbed environment includes various implementations of the AES engine as well as the Vortex router contained on a Dini Group DNPCIe_10G_K7_LL_QSFP FPGA development board. This particular board, designed to minimize Ethernet packet latency (10Gb or 40-Gbit), features the Xilinx Kintex-7 FFG676 FPGA. Compared to previous Xilinx FPGA families, the Kintex-7 offers increased power reduction (50% less power consumed). Other features of the Kintex-7 family include DSP slices with 25x18 multipliers, 36 Kb dual-port block RAM, and clock management tiles (CMT) for high precision. The Kintex-7 FPGA is attached (via PCIe) to a Windows 7 64-bit machine running an Intel ® Xeon TM 3.20 GHz dual-core processor. Behavioral simulations of our AES core design was performed utilizing the Xilinx ISE Design Suite 13.4. Estimated power consumption was determined using XPower Analyzer. Power estimates are dependent on signal switching activity, clock frequencies, timing constraints, and the device environment. Preliminary results are shown in the next section.
VIII. PERFORMANCE VALIDATION AND TRADEOFF ANALYSIS
To evaluate our AES engine, we use the Xilinx ISE Design Suite 13.4. Two configurations have been implemented in this paper, the single-stage pipeline and the full-stage pipeline. In the single-stage pipeline configuration, a single round of the AES algorithm has been implemented in hardware. This singlestage AES FPGA-based implementation is composed of the AddRoundKey, MixColumns, ShiftRows, and SubBytes functions. The single-stage pipeline configuration allows one block of data to be processed at a time. In the full-stage pipeline configuration, eleven single-stage AES rounds have been implemented, this enable the processing of multiple blocks of data simultaneously. For this configuration, the AES engine was able to achieve better throughput, low latency compared to the single-stage pipeline configuration.
We evaluated our proposed AES FPGA-based engine using the single-stage and full-stage pipeline configurations. A number of analyses were performed to measure latency, throughput, and resources (Slice register, Block RAM, LUT, and power consumption) for both AES pipeline configurations. Figure 4 illustrates the simulation results for the single-stage pipeline configuration. In our AES performance analysis, three pipeline configurations were performed to measure latency, throughput, power consumption, and resources, STATE, COLUMN, and BYTE. Figure 4(a) shows the achieved latency using the STATE, COLUMN, and BYTE pipeline configuration. As shown in Figure 4 (a) there is a 115% increase in latency in BYTE configuration compared to the STATE configuration. In the STATE configuration, the entire block of data was processed once compared to single data byte processing. In terms of resources, there is 6% increase in the number of slice registers (see Figure 4 (b)), and a 22.8% increase in slice LUTs as shown in Figure 4 (e) used in the STATE compared to the BYTE implementation. Meanwhile, the Block RAMs (see Figure 4(c) ) are relatively constant in both configurations. For Throughput (see Figure 4(d) ) and power consumption (see Figure 4 (f)), there is 113% increase in throughput and 4.15% increase in the total power consumption. Figure 4 presents the simulation results for the full-stage pipeline implementation. Similar to the single-stage AES performance analysis, we evaluated the performance of fullstage pipeline configuration using various metrics (latency, throughput, power consumption, and resources). As shown in Figure 4 (a) there is 107% increase in latency in the BYTE compared to full-stage pipeline. For the full-stage resource evaluation, there is a 10% increase in slice registers and 3.4% increase in LUT in STATE configuration compared to the BYTE configuration. Meanwhile, the Block RAM is relatively constant for both configurations. For Throughput and power consumption, there is 111% increase in throughput and 16% increase in total power IX. RELATED RESEARCH A number of iterative AES ASIC design with varying data path widths are proposed in [9] . The design is based on an efficient SubByte architecture and include operation for 128-bits keys. Round keys are generated on-the fly, either by sharing Sub-bytes with the main data path or be dedicating separate Sub-Byte for the key expansion. The smallest version is 32-bits AES architecture with four shared Sub-Byte. Furthermore, various AES implementation using the ASIC or FPGA have been reported. Some algorithms focus on the small chip was using the rolling architecture , the encryption throughputs were approximately between 1 to 1.4 Gbps and the size of the memory for the core is just 32Kbits. Other AES implementations [5] , [6] , [7] , and [8] were focused on the unrolling architecture, the round blocks are pipelined and inserted pipeline registers allows simultaneous operation of Nr rounds data X. CONCLUSION In this paper, we propose a dynamically run-time reconfigurable power aware cryptocraphic core for secure autonomous communication of small, power-limited mobile devices. Our cryptographic architecture is comprised of several components: an AES engine, a multi-level resource optimization controller and vortex data router [15] . Data encryption and decryption are handled by the AES engine which is capable of supporting multiple levels of security. Using the vortex data router, we conceived our cryptographic coprocessor as a set of distributed AES cores, in which the different AES cores will be interconnected through the robust NoC We evaluated our proposed FPGA-based AES engine using the single-stage and the full-stage pipeline configurations. A number of analyses were performed to measure latency, throughput, and resources (Slice register, 
