A high-speed and secure dynamic partial recon¿guration (DPR) system is realized with AES-GCM that guarantees both con¿dentiality and authenticity of FPGA bitstreams. In DPR systems, bitstream authentication is essential for avoiding fatal damage caused by unintended bitstreams. An encryption-only system can prevent bitstream cloning and reverse engineering, but cannot prevent erroneous or malicious bitstreams from being con¿gured. Authenticated encryption is a relatively new concept that provides both message encryption and authentication, and AES-GCM is one of the latest authenticated encryption algorithms suitable for hardware implementation. We implemented the AES-GCMbased DPR system targeting the Virtex-5 device on an offthe-shelf board, and evaluated its throughput and hardware resource utilization. For comparison, we also implemented AES-CBC and SHA-256 modules on the same device. The experimental results showed that the AES-GCM-based system achieved higher throughput with less resource utilization than the AES/SHA-based system. The AES-GCM module achieved more than 1 Gbps throughput and the entire system achieved about 800 Mbps throughput with reasonable resource utilization. This paper clari¿es the advantage of using AES-GCM for protecting DPR systems.
INTRODUCTION
Some recent Field-Programmable Gate Arrays (FPGAs) provide the ability of dynamic partial recon¿guration (DPR) , where a portion of the circuit is replaced with another module while the rest of the circuit remains fully operational. By using DPR, the functionality of the system is reactively altered by replacing a hardware module according to, for example, a user request, performance requirement, or environmental change. The Àexibility of DPR is expected to make hardware systems multifunctional, cost ef¿cient and area ef¿cient. DPR also achieves short con¿guration time and consequently makes recon¿gurable computing more practical and operational. The application of DPR is studied in the ¿elds of content distribution [1] , image processing [2] , automotives [3] , fault-tolerant systems [4] , and software de¿ned radio [5] among others.
In a system where intellectual property (IP) cores are replaced using DPR, the security of the IP core bitstreams is of primary concern. To guarantee con¿dentiality of proprietary IP cores, bitstream protection using a cryptographic algorithm such as Advanced Encryption Standard (AES) [6] is quite effective and is widely applied in practical systems. The encryption prevents bitstream cloning and reverse engineering. Several FPGA families have an embedded decryptor and can be con¿gured from an encrypted bitstream. However, such an embedded decryptor is available only for the entire con¿guration and not for DPR.
In addition to bitstream encryption, bitstream authentication is extremely signi¿cant in the protection of DPR systems [7] . An encryption-only system is insuf¿ciently secure because the system cannot prevent erroneous or malicious bitstreams from being con¿gured. Since hardware architecture itself is changed in DPR systems, an unauthorized bitstream can cause fatal, unrecoverable damage to the system. In the encryption-only system, the malicious bitstream will be jumbled by the decryptor to generate meaningless data. However, there still remains a possibility that the erroneous bitstream will damage the FPGA by indiscriminately setting the internal logic, I/O, interconnect and so on. To guarantee the authenticity of a message, Secure Hash Algorithm (SHA) [8] is widely used.
Since both con¿dentiality and authenticity of bitstreams must be guaranteed, authenticated encryption (AE) [9] must be effectively applied to DPR systems. AE is a relatively new concept in cryptographic technology, providing both message encryption and authentication. AE is expected to lead to area-ef¿cient implementation when compared with the use of separate encryption and authentication algorithms. It will also enable high-speed implementation eliminating overheads of data synchronization between two separate algorithms.
To protect a DPR system against erroneous or malicious bitstreams, we implement the latest AE algorithm Galois / Counter Mode (GCM) of operation [10, 11] with AES. GCM is based on the counter mode of operation (CTR) and uses universal hashing in the ¿nite ¿eld GF (2 w ) [12] . GCM is pipelinable and parallelizable, and thus suitable for hardware implementation. As is explained in [13] , other AE algorithms are not necessarily suitable for hardware implementation because they are not parallelizable or pipelinable. Additionally, other algorithms have weakness against bitÀipping attacks. Therefore, the use of AES-GCM is currently the best solution for protecting bitstreams with both encryption and authentication. This paper presents the architecture, implementation results and performance evaluation of an AES-GCM-based DPR system. The system is implemented targeting Virtex-5 on an off-the-shelf board, and we verify that its mechanism of bitstream encryption and authentication successfully works. For comparison, an AES-CBC and SHA-256-based DPR system is also implemented on the same device. To compare resource utilization with past studies, both systems are also implemented on Virtex-II Pro.
The rest of this paper is organized as follows. Section 2 introduces past studies on DPR security. Section 3 explains the partial recon¿guration of Xilinx FPGA. Section 4 brieÀy explains the cryptographic algorithms related to our implementation. Section 5 describes the system architecture of our AES-GCM-based and AES-CBC/SHA-256-based DPR systems. Section 6 describes the implementation results and evaluation of the developed DPR systems, and ¿nally Section 7 summarizes this paper.
RELATED WORK
Xilinx Virtex series devices support con¿guration with an encrypted bitstream. Virtex devices have a built-in bitstream decryptor. Virtex-II and Virtex-II Pro support Triple Data Encryption Standard (Triple-DES) [14] with a 56-bit key, while Virtex-4 and Virtex-5 support AES with a 256-bit key. The secret key is stored in the dedicated volatile memory inside the FPGA. Therefore, the storage must always be supplied with power through an external battery. Unfortunately, the functionality of the con¿guration with an encrypted bitstream is unavailable when using DPR. If the device is con¿gured using the built-in bitstream decryptor, DPR function is disabled. Therefore, in DPR systems, a partial bitstream must be decrypted with user logic.
Bossuet et al. proposed a secure con¿guration method in DPR systems [15] . In their system, an arbitrary cryptographic algorithm can be employed because the bitstream decryptor itself is implemented as a recon¿gurable module. Their method uses bitstream encryption but does not consider its authenticity.
Zeineddini and Gaj developed a DPR system that used separate encryption/authentication algorithms for bitstream protection [16] . AES was used for bitstream encryption and SHA-1 for authentication. AES and SHA-1 were implemented as C programs and run on two types of embedded micro processors: PowerPC and MicroBlaze. The total processing times of authentication, decryption and con¿gura-tion of a 14-KB bitstream with PowerPC and MicroBlaze were approximately 400 ms and 2.3 sec, respectively. These performances, however, would be insuf¿cient for practical DPR systems.
Parelkar used AE to protect FPGA bitstreams [17] and implemented various AE algorithms: Offset CodeBook (OCB) [18] , Counter with CBC-MAC (CCM) [19] and EAX [20] modes of operation with AES. To compare the performance of the AE method with a separate encryption and authentication method, SHA-1 and SHA-512 are also implemented with AES-ECB (Electronic CodeBook). 
PARTIAL RECONFIGURATION OF FPGAS

Partial Recon¿guration Overview
In Xilinx FPGAs, a module to be dynamically replaced is called a Partially Recon¿gurable Module (PRM), and an area where PRM is placed is called a Partially Recon¿gurable Region (PRR). PRM can be an arbitrary size of rectangular. Figure 1 shows an example structure of the partially recon¿gurable design.
The smallest unit of a bitstream that can be accessed is called a frame. In Virtex-5 devices, a frame is 1312-bit con¿guration information corresponding to the height of 20 con¿gurable logic blocks. A bitstream of PRM is a collection of frames. Each device family has different frame structures, but this paper does not focus on other devices.
Bus Macro
All signals between a PRM and a ¿xed module must pass through bus macros to lock the wiring. In Virtex-5 devices, the bus macro is a 4-bit-wide pre-routed macro composed of four 6-input Lookup Tables (LUTs). The bus macro must be placed inside a PRM. The bus macros of the older device families are 8-bit-wide pre-routed macros composed of sixteen 4-input LUTs, and placed on the PRM boundary.
Internal Con¿guration Access Port
Virtex-II and newer Virtex series devices support self DPR with the Internal Con¿guration Access Port (ICAP). ICAP basically works in the same manner as the SelectMAP con¿guration interface. Since user logic can access con¿gu-ration memory through ICAP, partial recon¿guration of an FPGA can be controlled by internal user logic. In Virtex-5 devices, the data width of ICAP can be selected from 8, 16 and 32 bits.
CRYPTOGRAPHIC ALGORITHM
Advanced Encryption Standards
AES is a symmetric key block cipher algorithm standardized by the U.S. National Institute of Standard and Technologies (NIST) [6] . While the previous DES [21] has a Feistel network architecture, AES employs a substitution-permutation network (SPN) architecture. The block length of AES is 128 bits, and the key length is selected from 128, 196 and 256 bits. 
Galois/Counter Mode of Operation
A block cipher algorithm can be applied to various modes of operation. The GCM [10] is one of the latest modes of operation standardized by NIST [11] . Figure 2 shows an example operation of GCM.
The encryption and decryption scheme of GCM is based on CTR mode of operation [22] . Thus, GCM can be highly parallelized and pipelined and is therefore suitable for hardware implementation, achieving a wide variety of performances from compact to high speed [23, 24] . Some other AE algorithms are not necessarily suitable for hardware implementation because they are unable to be parallelized or pipelined [13] .
AES-GCM is one of the AE algorithms providing both message con¿dentiality and authenticity. GCM uses universal hashing in the ¿nite ¿eld GF (2 w ) for generating a message authentication code (MAC). The additional merit of using GF (2 w ) is that the computation cost of multiplication under GF (2 w ) is less than integer multiplication. AES-GCM provides high security suitable for hardware implementation. Therefore, the use of AES-GCM is the best solution for protecting FPGA bitstreams in DPR systems.
Secure Hash Algorithm
SHA is widely used to guarantee the authenticity of a message. SHA is one of the cryptographic hash functions that generates a particular length of a message digest. Currently, ¿ve algorithms, namely, SHA-1, SHA-224, SHA-256, SHA-386 and SHA-512, are de¿ned denoting the length of the output message digest (the output length of SHA-1 is 160 bits). The latter four algorithms are collectively referred to as SHA-2. Since SHA-1 has been reported to have security vulnerability [25] , SHA-2 should be used instead for message authentication.
SYSTEM ARCHITECTURE
DPR system with AES-GCM
In DPR systems, both the con¿dentiality and the authenticity of PRM bitstreams should be guaranteed. As mentioned in Section 4.2, AES-GCM is one of the most promising algorithms to achieve this purpose. Figure 3 shows a block diagram of the DPR system with bitstream encryption and authentication using AES-GCM. In the system, the length of the AES key and an initial vector are set to 128 bits and 96 bits, respectively. The downloaded bitstream of PRM is decrypted by the AES-GCM module and its authenticity is simultaneously veri¿ed. The secret key of AES is embedded in the system. The decrypted bitstream is stored to the 128x2048 bits (32KB) internal memory (Block RAM). A bitstream is checked for its authenticity every 32KB, so a large bitstream is divided into several 32KB blocks. Recon¿guration of PRM starts after the bitstream is authenticated by AES-GCM. Authentication and recon¿guration cannot be parallelized or processed in a ¿ne-grained pipeline because the encrypted bitstream can not be input to ICAP before its authenticity is inspected.
AES-GCM
GCM is based on the CTR mode and uses universal hashing in the ¿nite ¿eld GF (2 w ). The S-box of AES is implemented as a table using Block RAM. In AES-GCM, a 128-bit block is decrypted in 12 clock cycles. The last block of the message requires 12 and an additional 10 clock cycles to calculate the authentication tag.
Suppose that the size of the bitstream is N [byte] and the clock frequency of the system is f [MHz] . When N is suf¿ciently large, the additional 10 cycles for the tag calculation is safely ignored. Thus the maximum throughput of the AES-GCM module P gcm is
Recon¿guration of PRM
Unlike other DPR systems, our system does not use an embedded processor to control partial recon¿guration. The input data and control signals of ICAP are directly connected to and controlled by the user logic. Thus, our system is free from the delay of processor buses. In the system, the width of the ICAP data port is set to 32 bits. When the frequency of input data to ICAP is f [MHz], the maximum throughput of the recon¿guration process P icap is
In Virtex-5, the maximum frequency of the ICAP is limited to 100 MHz, thus the ideal throughput of the recon¿guration process is 3,200 Mbps.
As the main purpose of this study is to clarify the feasibility of AES-GCM for bitstream encryption and authentication, rather simple function blocks, e.g., 28-bit adder and 28-bit subtractor, are used as PRM. PRM is connected to the static modules with two bus macros. The most signi¿cant 4 bits of the adder or the subtractor are output from PRM and connected to LEDs on the board. The PRR contains 80 slices, 640 LUTs and 320 registers. The size of the PRM bitstream is about 11KB.
DPR System with AES-CBC + SHA-256
To compare the performance of the AE method and the separate encryption/authentication method, we also implemented the AES-CBC and SHA-256 modules for bitstream encryption and authentication. Figure 4 shows a block diagram of the DPR system with AES-CBC and SHA-256. Bitstream encryption using AES-CBC and authentication using SHA-256 are processed in parallel. As the same as AES-GCM, the decrypted bitstream is stored to a 128x2048-bit Block RAM. Recon¿guration of PRM starts after the authenticity of the bitstream is veri¿ed.
AES-CBC
In our system, one of the most major con¿dentiality modes AES-CBC is used. The simplest mode of operation (AES-ECB) is not employed because it is not suf¿ciently secure for practical use [22] . The CBC mode can be used for generating a message authentication code (MAC) [26] , but it is not employed because the CBC-MAC algorithm reportedly has security de¿ciencies [27] . Therefore, our system employs the different authentication algorithm, SHA-256, for bitstream integrity check.
Similar to the AES-GCM system, the S-box is implemented as a table using Block RAM. In AES-CBC, a 128-bit block is decrypted in 11 clock cycles. When the operating frequency of the system is f [MHz], the maximum throughput of the AES-CBC module P cbc is
SHA-256 Module
Since SHA-1 reportedly has security vulnerability [25] , SHA-256 is selected for the authentication algorithm. 
SHA-256 processing takes longer cycles than AES; therefore, the throughput of the overall bitstream processing is restricted by SHA-256. While the SHA-256 algorithm is relatively simple and straightforward, it is dif¿cult to process in parallel or pipeline. Thus, the performance of the SHA-256 module is dif¿cult to improve.
Recon¿guration of PRM
Recon¿guration of PRM in the AES-CBC/SHA-256 system is performed in the same manner as that in the AES-GCM system after the authenticity of the bitstream is veri¿ed by the SHA-256 module. When the operating frequency is f 
IMPLEMENTATION
This section describes the implementation results of the AES-GCM-based system (hereinafter PR-AES-GCM) and AES-CBC/SHA-256-based system (hereinafter PR-AES-SHA). PR-AES-GCM and PR-AES-SHA are implemented targeting Virtex-5 (XC5VLX50T-FFG1136) on an ML505 board [28] , and we veri¿ed that DPR successfully works on the systems. The systems are designed using Xilinx Early Access Partial Recon¿guration (EA PR) Àow [29] and implemented with ISE 9.1.02i PR10 and PlanAhead 9.2.7 [30] . Table 1 and Table 2 show the hardware utilization of PR-AES-GCM and PR-AES-SHA implemented on Virtex-5, respectively. The item "Overall" shows the total amount of hardware resource used by all modules except PRM. Table 1  and Table 2 also describe the hardware utilization of each module of stand-alone implementation.
Hardware Resource Utilization
The hardware architecture of Virtex-5 is vastly different from that of earlier devices such as Virtex-II Pro and Virtex-4. The slice of Virtex-5 contains four 6-input LUTs, whereas that of earlier devices contains two 4-input LUTs. Thus, the number of used slices becomes smaller in Virtex-5 implementation. To make a fair comparison with other studies, we also implemented the systems on Virtex-II Pro (XC2VP30-FF896). The hardware utilization of PR-AES-GCM and PR-AES-SHA on Virtex-II Pro are given in Table 3. 
Performance Evaluation
The clock frequencies of PR-AES-GCM and PR-AES-SHA are both 100 MHz. To enable comparison with [16] , the computation time required to con¿gure a 14,112-byte PRM is described in Table 4 . In PR-AES-GCM, the overall processing time for the PRM con¿guration is simply 
In PR-AES-SHA, authentication and decryption are processed in parallel. Therefore, the overall processing time is max(160.97, 97.14) + 35. 3 = 196.27 [μs] .
In PowerPC and MicroBlaze systems, authentication, decryption and recon¿guration are sequentially performed. As such, the overall processing time is simply the sum of each processing time. Using the equation (1) through (4), the performance of the systems are calculated, as shown in Table 4 . Table 4 also gives the throughput of other AE algorithms reported in [17] .
Analysis of the Results
As Tables 1 and 2 show, PR-AES-GCM utilizes less registers, LUTs and slices than PR-AES-SHA for the implementations on Virtex-5. The results indicate that AES-GCM is more area ef¿cient than separate algorithms of AES-CBC and SHA-256. Implementing on Virtex-II Pro, PR-AES-GCM utilizes less registers and slices than PR-AES-SHA, though utilizes slightly more LUTs. As shown in Table 4 , PR-AES-GCM achieved the highest overall throughput of about 800 Mbps with only 19% slice utilization. The AES-GCM module achieves a throughput of more than 1 Gbps, which is faster than those of other AE methods of OCB, CCM and EAX. Furthermore, PR-AES-GCM uses less slices than other AE methods. Note that PR-AES-GCM includes additional modules such as a recon¿guration controller and an LED controller. The results shows that high-speed and area-ef¿cient implementation is achieved by PR-AES-GCM.
Since AES-GCM can be processed in parallel and pipeline, AES-GCM can obtain much higher throughput using more hardware resources. AES-GCM provides very Àexible architecture from compact to high speed.
In the PR-AES-SHA system, the AES module achieved the highest throughput of 1164 Mbps, while the overall throughput is relatively low (575 Mbps). This is because the throughput of the SHA-256 module is relatively low (701 Mbps). Since the SHA-256 algorithm is quite straightforward and hardly parallelized or pipelined, improving the SHA-256 throughput is dif¿cult. Although the various hardware architectures of AES can achieve a wide variety of performances, the SHA-256 module will restrict the overall performance of the system. This is the disadvantage of using SHA-256 for bitstream authentication.
The DPR systems with the PowerPC and MicroBlaze systems require the overall computation time from several hundred milliseconds to several seconds. This will not be acceptable for practical DPR systems. Therefore, authentication, decryption and recon¿guration should be processed using dedicated hardware to achieve practical DPR systems. Comparing to the software AE systems, our approach attained extremely high performances. PR-AES-GCM achieved 2843 times higher throughput than the PowerPC system, and 16087 times higher throughput than the MicroBlaze system.
CONCLUSIONS
We developed a secure dynamic partial recon¿guration (DPR) system with AES-GCM that guarantees both con¿dentiality and authenticity of FPGA bitstreams. AES-GCM is one of the latest authenticated encryption (AE) algorithms. Implementing on Virtex-5 (XC5VLX50T), AES-GCM achieved more than 1 Gbps throughput and the entire system achieved about 800 Mbps throughput suf¿cient for practical DPR use, utilizing less than 20% slices.
For comparison, we also implemented AES-CBC and SHA-256 on the same device. The implementation results show that the AES-GCM-based system achieves higher throughput and is more area ef¿cient than the AES/SHA-based system. Although AES can achieve a wide variety of performances from compact to high speed, SHA is a straightforward algorithm which is hardly parallelized or pipelined. Therefore, the performance of the AES/SHA-based system is restricted by the SHA module. The performance of the AES-GCM is also compared with other AE algorithms. The AES-GCM achieved higher throughput than other modes of operation such as OCB, CCM and EAX.
Considering the experimental results, it is concluded that the use of AES-GCM is currently one of the most promising approaches for protecting FPGA bitstreams and achieving high-speed and area-ef¿cient DPR systems.
