Abstract: While AES is computationally secure, it is not without weakness. Side-channel attacks on AES hardware implementations can reveal its secret key. Because of this vulnerability, countermeasures to side-channel attacks are crucial to data security. This paper assesses a current design and proposes a new design for securing AES, along with evaluating their implementation onto field programmable gate arrays. Both countermeasures were successfully implemented, and the data remained secure against common side channel attacks. Results indicate successful obfuscation of the secret key over the original AES algorithm.
Introduction
While the Advanced Encryption Standard (AES) is known to be computationally secure, it is vulnerable to side-channel attacks against hardware implementations. One of the more common hardware attacks is Differential Power Analysis (DPA) introduced by Kocher et al. [1] . DPA is performed by sending the AES algorithm known plaintext or receiving known ciphertext. During the process, power and/or electromagnetic traces from the encryption device are recorded and are correlated to the known data and key guesses. To counteract this vulnerability, DPA countermeasures attempt to create independence between the data and the power traces. The two main countermeasure types are hiding and masking [2] . Hiding attempts to cover the message with noise from the circuit or use other means such as timing variance to create independence between the data and the traces. Masking conceals the data by adding or multiplying random numbers to the data during the encryption process, and then removing the mask(s) before the output is determined. This paper focuses on two hiding techniques to obfuscate the data from an attacker. The first countermeasure focuses on preventing traces from being aligned by randomizing the clock frequency of the circuit, therefore hindering DPA attempts by an attacker. The idea was introduced by Zafar et al. [3] and the concept was simulated using simulation software such as MentorGraphics ModelSim. This research differs in that it provides a realized implementation outside of previous theoretical simulated designs. The second countermeasure focuses on a hiding technique called bit-balancing. The current bit-balancing trend for AES is dual-rail logic, which concurrently evaluates the data and the inverse of the data at the gate level [2] . This research develops and implements a countermeasure technique that provides bit-balancing at the system level, therefore greatly decreasing the complexity and size of the circuit compared to AES with dual-rail logic gates. The countermeasure described in this research is similar to [4] in many key areas, but differs in both implementation and design.
The two countermeasures were chosen for their potential effectiveness for obfuscation of the circuit while minimizing costs to the user. Previous obfuscation techniques, such as masking, have shown to be effective in providing independence between the data and power traces, typically requiring 2x to 10x increase in circuit size and 3x to 90x decrease in circuit speed [5] . The goal of this research is to provide effective defense against power attacks on AES while minimizing costs in circuit area, circuit speed, and circuit power consumption. This paper is organized as follows: an overview of the AES algorithm and common DPA techniques is given in Section 2. Section 3 provides results of DPA performed on hardware AES for the purpose of both developing a baseline for comparison and supporting the need for countermeasures. Section 4 discusses T. R. Andel et. al.: Design and Implementation of Hiding… hiding countermeasures designed to misalign power traces and presents an implemented design of a randomized clock. Section 5 looks at dual-rail logic and bit-balancing countermeasures, as well as provides an effective new system-level bit-balancing implementation followed by conclusions in Section 6.
Background of AES and DPA
In order to put the remainder of the paper in context, a brief overview of the AES algorithm and DPA attacks is Provided.
AES Background
The AES algorithm is a form of the Rijndael algorithm and is the current standard for symmetrickey cryptography [6] . It is a symmetric block cipher that processes 128-bit data blocks and can operate on keys with lengths of 128, 192, or 256 bits. Each data block consists of 16 bytes and is four rows deep and four columns wide. Each individual block is processed through 10, 12, or 14 rounds, depending on the key size. For this paper, when referring to the AES algorithm, we assume a key length of 128 bits using a total of 10 rounds. A pictorial overview of AES is given in Figure 1 . Each round consists of four individual transformations of the data: SubBytes, ShiftRows, MixColumns, and AddRoundKey. The algorithm begins by adding the RoundKey and continues into the rounds. The last round does not contain the MixColumns function. Of the four transformations, only the SubBytes function is nonlinear, and is therefore subject to great interest for power attacks [7] . SubBytes is a table lookup from a substitution box (SBOX), which simply substitutes a byte out for respective byte input. Also, as seen in Figure 1 , the key is used to create round keys derived from the key schedule. More detail on the AES algorithm can be found from the FIPS publication [6] .
DPA Background
Differential Power Analysis is a powerful attack that uses statistical analysis and error correction techniques to extract information correlated to the secret keys [1] . There are two main phases of a DPA attack: data collection and data analysis. Data collection is performed by sampling a circuit's power consumption or electromagnetic (EM) radiation during operation of the cryptographic algorithm. Data analysis is the more involved phase of DPA and often involves correlation of the collected power traces to a hypothetical power model created by the attacker. There are several models used to create hypothetical intermediate values used for correlation between the data and the power traces. The most common models are the bit model, Hamming Weight (HW) model, Hamming Distance (HD) model, and the Zero-Value (ZV) model [2] .
The bit model looks at one bit of an intermediate value at a time and separates the collected power traces into two groups. The first group is whether the intermediate value should be a 0 for the current key guess, and the second group is if the intermediate value should be a 1. While straightforward, the bit model is limited in scope and is generally not as effective as the HW, HD, or ZV models [2] .
The HW model counts the number of '1's in a binary number and assigns that to its value. For example, 00011000 would have a HW of two. The HW model is effective in modeling power traces because of the difference in power drawn from a circuit evaluating a '1' compared to a '0'.
The HD model is even more effective than the HW model, but requires knowledge of either the preceding or following intermediate value, as well as the current hypothetical intermediate value. The HD is the difference in '1's between two binary numbers. For example, if a number were to change from 00011000 to 11111111, the HD would be six. There is usually a significant change in power when a device in a circuit changes its output value from '0' to '1' or vice versa. In this way, the HD model can be a very effective model for hypothetical power consumption.
Finally, the ZV model assumes that the power consumption for the data value 0 is lower than the power consumption for all other values [2] . The data within a ZV model is set as either a 0 or 1 depending on whether the data evaluated is equal to 0.
Regardless of the chosen model, they all require an intermediate value to be calculated in order to operate. When evaluating only one byte at a time, it is beneficial to avoid the MixColumns function. Otherwise, more than one byte of the intermediate value (and therefore more than one byte for the key guess) would have to be assumed [6] . This approach would greatly increase the complexity of the DPA and exponentially increase the processing time. Because of this limitation, intermediate values before the first MixColumns operation or after the last MixColumns operation are attacked for DPA on AES.
DPA Results on Hardware AES
For a baseline design, a simple iterative Verilog version of AES is used [8] , where each round is performed sequentially before the next byte is processed. Power traces are collected using the Riscure Inspector side channel test suite, including an EM high sensitivity probe, an EM Probe Station, and the Riscure Inspector software. Power traces are measured using a Lecroy WavePro 725Zi oscilloscope, and the AES design is loaded onto a Xilix Virtex-5 FX FPGA evaluation board. A visual overview of the experimental setup is given in Figure 2 . The number of traces collected for each AES system tested in this research is varied between 1,000 and 3,000,000. The frequency of the bus is set to 100 MHz. To minimize unwanted noise, the traces are filtered using a bandpass filter between 90 and 210 MHz. Each trace uses the same key given in Equation 1, and the input data is randomized for every trace using Java's randomize function. An example of a power trace is given in Figure 3 , with AES encryption occurring between 1.1 and 1.7 sec. Key = 0001 0203 0405 0607 0809 0A0B 0C0D 0E0F (1) Last Round Key = 1311 1D7F E394 4A17 F307 A78B 4D2B 30C5
To effectively attack the AES algorithm, several different models were used on both the first and last round of the algorithm. Along with bit and ZV models, HW and HD models were used to correlate the traces to the data. During initial testing, the bit and ZV models proved less effective than the HW and HD attacks, and therefore were not considered in the final evaluation of correlation effectiveness. When evaluating the last round, the original key cannot be used, but instead the last round key as given in Equation 1. This is due to the fact that each round adds its corresponding round key, and the round key is only equal to the original key for the first round. Subsequent rounds use keys generated during the key expansion process of AES. The correlation is performed using MATLAB software. After evaluating each modeling technique on the different intermediate values, it appears the HD on the last round between the intermediate value before the SubBytes and the ciphertext (choice 7) provides the highest correlation. The results on an attack on the first byte are given in Figure 4 . The correlation for the correct key guess (19) is given by the solid blue line, and the attack is performed while the number of traces is varied. DPA is successfully performed when the number of traces is greater than or equal to 500,000 traces, as indicated when the correlation of the correct guess exceeds the peak false positive. The attack on the unprotected AES system gives a process baseline to compare to AES implementations with added countermeasures. The goal of this research is to effectively eliminate correlations such as seen in Figure 4 with minimal cost in added execution time and circuit area. The first countermeasure increases the challenge of aligning traces, therefore greatly increasing the difficulty of processing the power traces after collection. The second countermeasure effectively flattens the power signatures of the intermediate values in relation to the Hamming Weight and Hamming Distance of the bits.
Trace Alignment Countermeasures
In order to perform an effective DPA attack, trace alignment must be accomplished for correlation between the power traces and the power model. Without alignment, each power trace/expected intermediate value pair would have to be evaluated individually for each trace.
Existing Approaches
A relatively popular method for accomplishing misalignment is to randomly insert delay cells into the T. R. Andel et. al.: Design and Implementation of Hiding… algorithm. In general, inserting delay cells effectively misaligns the traces and hinders the effectiveness of a DPA attack [9, 10] . Figure 5 shows how misalignment can effectively minimize correlation between the traces and the data. However, as stated in [11] , there are methods to pinpoint and eliminate the delay cells within the power trace, therefore eliminating the misalignment attempts. In order to prevent this re-alignment, Zafar et al. introduced in [11] a clock variance method to accomplish misalignment with the same results as delay insertion, but without the risk of the delays being manually removed during DPA processing. This idea is further improved in [3] by randomizing each clock cycle instead of changing the clock frequency after a specified period of time. By using this approach, trace alignment is virtually impossible to accomplish using traditional methods. The downside of randomizing every clock cycle is an increase in the delay of the system. If the frequency of the clock is varied from 100% to 50% (from 1x to 2x slower), the average frequency will be 75% of the original (1.5x slower on average). This degradation is significant when comparing to a handful of manually inserted single clockcycle delay cells added to the overall system runtime.
Randomized Clock Implementation
In order to randomize the clock in this research, a 16-bit linear feedback shift register (LFSR) is polled at the end of each clock cycle to pseudo-randomly choose between four clock cycle lengths. The LFSR is seeded with a pseudo-random value at the beginning of each encryption process to ensure variance between algorithm runs. Optimally, the clock should vary between 100% and 50% of the original frequency.
However, for this research, the clock varies between 33% and 16.7%. Initially, the idea was to simply increase the clock frequency on the Virtex-5 bus by 300%, but the clock was already set near its maximum frequency (100 MHz with a limit of 125 MHz). An external clock could be added to the system, but that is out of the current scope of the project. The randomized clock design unique to this research is given in Figure  6 . Note that the clock in Figure 6 represents the clock period. Therefore, the clock generator effectively slows the clock rate by the indicated amounts. Figure 7 . It is important to notice the variation when compared to Figure 3 . In fact, the power signature varies greatly for each iteration when there is a randomized clock. In order to determine effectiveness in obfuscation of the circuit, the same DPA attack is performed on the circuit with a randomized clock as performed on the original circuit. The key guess results for byte 1 are shown in Figure 8 As expected, there is no apparent correlation for the correct key guess. The main difficulty in performing DPA is aligning the traces of the AES algorithm using a randomized clock frequency. Riscure Inspector software includes several advanced alignment algorithms, however, they did not provide correct alignment against the randomized clock countermeasure.
By randomizing the clock, the circuit is effectively protected from common DPA attacks by preventing alignment of the traces. The total area of the circuit changes only minimally to add the random clock generator. However, the runtime of the circuit averages to 4.5x longer than the original, or an average 25% of the original frequency. Optimally, if an external clock was added to the system, the runtime of the circuit could be minimized to 1.5x longer than the original, or an average 75% of the original frequency.
Dual-Rail and Bit-Balancing
Countermeasures One of the more popular hiding countermeasures is the attempt to flatten the power signature of all components directly within the circuit's hardware for all values of data. This approach commonly utilizes dual-rail or bit-balancing logic as countermeasures. 
Existing Approaches
An effective method is performed at the cell level using dual-rail precharge (DRP) logic blocks [2] . The idea behind DRP logic is to create logic cells that make power consumption constant during each clock cycle. Every input and output into a cell is paired with its inverse and therefore a constant balance of '0's and '1's are entering and exiting the cell at all times. Figure  9 gives an example of a dual-rail AND gate. There have been several dual-rail designs attempted in order to protect cryptographic algorithms, and in general are quite effective [12, 13, 14] . However, one major problem with dual-rail systems is the increase in circuit area and associated decrease in speed. Designs such as [12] increase circuit area by 4.5x and increase runtime by nearly double. While dual-rail logic is effective in minimizing the effectiveness of HW and HD attacks, its complexity does not come without a price.
Similar to dual-rail logic, system-level bitbalancing attempts to balance the Hamming Weight for every intermediate value. Attempted by [4] , systemlevel bit-balancing runs two concurrent cryptographic algorithms. The first algorithm performs as expected and delivers the correct output data. The second algorithm processes the inverse of the data and produces inverse output data. When evaluated as a whole, the Hamming Weight of the AES system remains constant during the entire encryption process. The challenge is combining the two circuits in such a way that an attacker cannot differentiate the power emanating from the two separate algorithms.
System Level Bit-Balancing Design
In addition to the randomized clock, this research develops a system level bit-balancing design to further obfuscate the circuit against HW and HD attacks. The bit-balancing design is similar to [4] , but differs in several key areas. First, the key remains unchanged as opposed to [4] . Second, the key schedule is left untouched and the round keys remain the same. Third, T. R. Andel et. al.: Design and Implementation of Hiding… the input data is inverted before entering the AES algorithm, unlike the design in [4] . The design for the inverted circuit is given in Figure 10 . To begin, the input data is inverted before entering the system. Due to their linearity, the AddRoundKey, ShiftRows, and MixColumns components of the AES algorithm retain the inversion of the data (inverted input = inverted output from the components). However, the main design challenge comes from the non linear SubBytes component. Finally, this design includes HD resistance, while [4] does not.
In order for the output data from the SBOX to be inverted when provided an inverted input, it is helpful to look at how the SBOX handles the data. For the specific AES algorithm used in this research, a lookup table is used for the SubBytes function. The SBOX lookup table is given in Table 1 [6] .
Each cell is indexed by the input value. For example, an input of 0x03 would output 0x7B. Because the output needs to be inverted, each value within the SBOX is replaced with its inverted value (0x7B would be replaced with 0x84) for the inverted AES algorithm. However, one more step must be taken to ensure an inverted output. Because the input data will be inverted, indexing must also be inverted. Therefore, what was in the top left cell must now be moved to the bottom right, etc. Once this index "rotation" is accomplished, the new SBOX is ready to output inverted data given an inverted input. The new SBOX for the inverted circuit is shown in Table 2 . Note that 0x84 (the inverse of 0x7B) is now indexed by 0xFC, the inverse of the original index 0x03.
Simply inverting all the intermediate values only protects against HW attacks and not HD attacks. This is a problem, especially since HD attacks are commonplace and relatively easy to perform. The system-level bit-balancing design for this research also differentiates itself by including features to resist HD attacks along with its HW resistance. In order to protect against HD attacks, gate level pre-charging has been used in the past [2, 12, 13] . However, this research uses system-level bit-balancing, so system-level precharging is needed. To minimize the vulnerability from HD attacks, 10 buffer cycles are added between the rounds to clear the intermediate values within the rounds. This system-level precharging is accomplished by sending a 0 as the input data and key into the round. This buffering allows for the HW bit-balancing to effectively prevent against HD attacks as well.
Of course, the output of the rounds must be stored between clock cycles while the system is clearing the intermediate values. This register storage introduces a potential vulnerability. To mitigate the potential for an attacker to perform HD correlation on the registers storing the intermediate values between rounds, an extra feature is added to the design. An LFSR is connected to a multiplexer, and the outputs of the rounds are stored in one of four locations randomly chosen by the LFSR. In this way, there is only a 25% chance of the HD recorded by the EM probe matching the data. Therefore, the HD leakage is effectively minimized. Depending on security requirements the number of possible register locations could be increased, but at a cost of required circuit area and speed. A graphical view of the correlation between the traces and power models is given in Figure 11 . As expected, the system-level bit-balancing design is resistant to common DPA attacks with up to three million or more traces collected.
For the bit-balancing design, the added 10 clock cycles adds minimal delay to the system, while the area and power consumed by the countermeasure are around triple that of the unprotected design. These costs are still desirable over comparable dual-rail logic designs. The system-level bit-balancing design is resistant to both HW and HD attacks, and is implemented as one interspersed module on the chip -making differentiation of the two halves of the algorithm significantly difficult. 
Conclusion
The countermeasures developed and tested in this research are effective in obfuscating the key and intermediate data of the AES algorithm from standard DPA attacks. Those attacks include correlation with bit, Hamming Weight, Hamming Distance, and ZeroValue models used to determine the secret key. The randomized clock prevents an attacker from easily aligning traces, a step necessary to perform DPA. The second countermeasure, system-level bit-balancing, was designed with HW and HD attacks in mind, and is implemented on a single chip. Of course, there are added costs to the user in terms of circuit delay, area, and power consumed, but all costs are within reasonable levels compared to similar hiding countermeasures. The results are summarized in Table  3 . The DPA resistance is given in number of traces needed to attack the circuit. However, it is important to note that the randomized clock and bit-balancing countermeasures were never successfully attacked, therefore concluding that over three million traces would be needed to eventually attack the circuits.
Because of the relative independence between the two countermeasures, combining the randomized clock and system-level bit-balancing design would only add to the side-channel attack resistance of the circuit, and should be considered for maximizing obfuscation. This research provides contributions for the protection of information processed using hardware AES. The results and analysis indicate successful demonstration of the goals of this research, mainly to obfuscate sensitive data against side-channel attacks. Not only was a design developed and presented, but implementation and real-world testing were performed using an FPGA system. It is hypothesized that results would be similar in customized application specific integrated circuit (ASIC) designs.
