This paper aims at presenting a new countermeasure against Side-Channel Analysis (SCA) attacks, whose implementation is based on a hardware-software codesign. The hardware architecture consists of a microprocessor, which executes the algorithm using a false key, and a coprocessor that performs several operations that are necessary to retrieve the original text that was encrypted with the real key. The coprocessor hardly affects the power consumption of the device, so that any classical attack based on such power consumption would reveal a false key. Additionally, as the operations carried out by the coprocessor are performed in parallel with the microprocessor, the execution time devoted for encrypting a specific text is not affected by the proposed countermeasure. In order to verify the correctness of our proposal, the system was implemented on a Virtex 5 FPGA. Different SCA attacks were performed on several functions of AES algorithm. Experimental results show in all cases that the system is effectively protected by revealing a false encryption key.
Introduction
Since Kocher et al. [1] , in the late 1990s, demonstrated the vulnerabilities of cryptographic devices, Side Channel Analysis (SCA) attacks have become the most significant threat related to the security of cryptographic algorithms.
These attacks base their success on analyzing the leakage information that is 5 mainly observable through the power consumption or the electromagnetic radiation (EM) emitted by a hardware device. The attack is feasible because either of these two quantities is related to the data being processed by the device, which depends on the value of the cryptographic key.
Once such weakness was revealed, part of the scientific community oriented 10 their efforts in proposing countermeasures that provide resistance against SCA attacks. Although with some differences, almost all proposed solutions attempt to design systems in which the power consumption (or the EM) is independent of the data that they process. This objective is achieved either by providing systems featured with random power consumption or building devices in which 15 such power is constant in each clock cycle. The latter approach, known as hiding, has usually been implemented at cell level based on the Dual-Rail Precharge (DRP) logic style. This style is tailored with signals represented by two complementary wires, in such a way that in every clock cycle only one switch per cycle is produced. Thus, during the pre-charge phase, both the direct and com- 20 plementary wires are charged,whereas in the evaluation phase only one of them is discharged. Among the more significant proposals of this logic style can be found Sense Amplifier Based Logic (SABL) [2] and Wave Dynamic Differential Logic (WDDL) [3] . However, the main drawback of such DRP logic styles is that their success depends on the perfect balancing between the capacitive loads 25 related to the complementary wires that form the overall circuit. This requirement implies including some constraints on the placement and routing steps.
In contrast, the former approach, known as masking, has been implemented at both algorithm and cell levels. At cell level, the most relevant proposals are Random Switching Logic (RSL) [4] , Dual Random Switching Logic (DRSL) 30 2 [5] and Masked Dual-Rail Precharge Logic (MDPL) [6] . Moreover, it has been shown that in general the security of cell level implementations could be compromised due to the effect of the inter-wire capaci-45 tances [7] or the so-called early propagation effect [8] [9] . As these vulnerabilities became known, the previous proposals were updated, including new measures that make systems more secure against most of these harmful effects. For instance, the original MDPL, which inherently is a glitch-free logic style based on majority-gates, was modified to support the early propagation effect (iMDPL) 50 [10] . Other examples of improved DPR styles can be found in [11] and [12] .
More recently, a new countermeasure termed SecLib has been proposed [13] .
The early evaluation is prevented by designing specific cells based on two stages that avoid such effect. However, as stated by the authors, it also increases the cost in terms of area, delay and power consumption.
55
Masking is a countermeasure that can be also implemented at algorithm level. In [14] , the authors proposed an implementation of the AES encryption algorithm using six independent masks. The algorithm was solved on an 8-bit microcontroller leading to an execution time twice that compared with the unmasked version. There are also some proposals for implementing hiding coun-60 termeasures on software. These approaches aim at introducing temporal jitter 3 in the sequence of operations performed by the microprocessor. This way, the instant at which an effective attack might be produced is distributed over time following an unknown probability distribution function (misalignment of power traces). Some examples of these software countermeasures consist of introducing 65 dummy cycles [15] or a random variation on the execution orders [16] .
Other different publications aim at introducing noise to reduce the correlation between the processed data and the cryptographic key. Following this idea an interesting approach was proposed in [17] , in which a noise generator correlated with the data that is being processed is included. However, the attack 70 is only effective when the target is the function correlated with the introduced power noise. Additionally, the revealed key is not always the same and it depends on the number of traces captured and used to perform the attack.
As mentioned above, SCA attacks based their success on exploiting the existing dependence between the processed data and the power consumption (or 75 EM). As data depend on the cryptographic key, from a statistical point of view it means that there exists a correlation between such a key and the consumed power. Theoretically, only the correct key is able to produce a correlation with a significant value, whereas the rest of the keys would generate a value close to zero. Countermeasures based on hiding or masking try to eliminate this corre-80 lation, in such a way that any SCA attack, performed on any possible key, does not produce any relevant result that could be distinguished among all others.
In other words, all correlations between power consumption and guessed keys are equally likely and tend to zero.
The countermeasure proposed in this paper is completely different when 85 compared with previous approaches. The mechanism for protecting the system consists in revealing a false key when a SCA attack is performed. This false key (or fake key) produces the highest correlation coefficient between the data processed and the power consumed by the hardware device. Thus, from the perspective of an attacker, the system behaves as an unprotected implementa- Although the proposed countermeasure, termed faking, could be entirely im-95 plemented in software, the penalty on the execution time would be quite significant. In fact, including all additional calculations needed to conceal the real key, such execution time is almost doubled when compared with the non-protected version. Instead, the implementation presented in this paper is based on a hardware/software co-design. The system consists of a microprocessor which solves 100 via software the classical Advanced Encryption Standard (AES) 128-bit cryptographic algorithm, and a coprocessor specifically designed for implementing the proposed countermeasure. The proposed architecture is intended for applications in which the main task performed by the microprocessor is to solve a specific processing from which a critical information is obtained. The encryption 105 is necessary for storing this confidential data in an external device or for sending such information through a non-secure channel. For instance, the microprocessor could be used for analyzing a fingerprint image from which a confidential biometric feature is obtained and should be stored in an external memory. Although is out the scope of this paper, in applications where the encryption is 110 the main task that should be performed, a complete hardware-implementation would be more suitable and faster. Regardless of the implementation chosen, hardware, hardware-software or pure software, the level of security for all of them is identical and only their features in terms of area and speed are different.
115
This paper is organized into five sections. Section II presents the fundamentals of the proposed countermeasure. The aim of Section III is to describe the internal architecture of the coprocessor and its main features. Section IV presents the experimental results. Finally, section V presents the conclusions 
Introduction
The structure of the AES 128-bit encryption algorithm is represented in [19] .
Although in the proposal presented by Kocher the cryptographic key was found using the differential-of-means method, currently the most extended statistical method employed for this purpose is based on correlation [14] . This 130 method consists of the following steps:
a) The encryption algorithm is executed M times using a set of M different plain texts. For each one, a current trace is captured and stored for its subsequent processing.
b) It is quite usual to choose as points to be attacked (target) the output of 135 one of the four operations (inputs of the following points) involved in the AES algorithm, since their result (state) is normally written in a memory or register, which creates a distinguishable point at the captured power trace. 
Note that, the choice of any of the four operations as target of the attack facilitates the calculation of the value related to the theoretical model of f) The highest correlation corresponds to the true encryption key.
Basis of the faking countermeasure
The underlying idea behind the proposed faking countermeasure is to carry out the encrypting process using a fake key Key F AKE , which is obtained by
XORing the real one Key REAL with a mask Key M ASK in the following way
Note that, in a general case all keys included in (2) consist of 16 different bytes each one, and they can be represented by a matrix of 4x4 bytes. The basic structure of the protected system is presented in Fig:2 . The operations 
Thus, a simple way of recovering the original byte a R (i, j) encrypted with
) at the output of SubBytes would be by defining SBoxTrans as follows
Using (2) and the algebraic properties of the XOR operator, (4) becomes:
so that computing (4) and (5) with an exclusive-OR, SBox(a R (i, j)) is directly obtained. In fact such an operation, termed remasking, is performed at the end of each round by computing the output of MixColumns and MixCol (see Fig:2 ).
It is noteworthy that SBoxTrans can be implemented as a simple lookup and its implementation, in the worst case, would be possible using a memory of 4Kbytes. In practice, it is advisable to choose the maximum value for I, in order to obtain the same fortress as the original AES algorithm (experimental results were obtained using a Key M ASK which consists of 16 different bytes).
Weakness of faking and use of masking techniques

195
When implementing the coprocessor, it is important to take into account some weaknesses that are usually related to hardware implementations.
Devices protected by Boolean masking approaches, which leak the Hamming distance, could not be completely secure if internal operations are not performed carefully. Indeed, taking into account (1) 
In devices that leak the Hamming weight distance, a vulnerability is pro- 
A similar idea is behind the so-called second-order attacks [20] . • Since Key F AKE is known, if the attacker is able to determine Key M ASK , then Key REAL would be obtained by simply applying (2) . Note that, the output of SBoxTrans in (4) 
Equation (8) is the real system implemented for SBoxTrans and it will be used to obtain the experimental results. Such a mask is obtained by 210 including a True Random Number Generator (TRNG) used to update its value for each encrypted plain text.
• It is observed that the output of SubBytes and the output of SboxTrans can be computed with an exclusive OR, leading to a new value
If such a value was not protected by the mask M h (i, j), then the system would be vulnerable to a second-order attack, since S COM B reveals the bytes a R (i, j) encrypted with Key REAL . Note that, although second or-215 der attacks are usually performed over masked systems, here we use the same terminology since the attack on S COM B exploits the leakage of two intermediate values (i.e. Sbox(a F (i, j)) and SBoxT rans(a F (i, j)).
• The function MixCol, included in the coprocessor, operates with the exclusive-OR on several bytes of different rows obtained at the output of SBoxTrans.
220
Note that these bytes are concealed with the mask M h (i, j). As described in ( formed by such masks is created. Note that, this new matrix can be also pre-computed before executing the encryption algorithm. Finally, it is observed that as, the expanded key and M k (i, j) are operated with an exclusive-OR before AddRoundKey is activated, any SCA attack on such 235 rounds would also reveal the fake key Key F AKE .
• Additionally, as the coprocessor uses a different set of masks at the input and output of MixCol, the possibility of performing a second-order attack between both intermediate values is eliminated.
As will be described in the next section, these and other measures of pro-240 tection are taken into account when designing the internal architecture of the coprocessor.
Architecture of the coprocessor
Some aspects related to the design of the coprocessor depend on how the microprocessor manages the execution of instructions and the access to data 245 stored in memory. Fig:3 shows a simulation that represents the processing of a byte when the microprocessor computes the SubByte function over the particular value v m = 0x05. The microprocessor employed for this purpose is the MicroBlaze, which is also used for obtaining the experimental results. This microprocessor consists of five pipeline stages, so that the execution of an in-250 struction is performed five cycles after its fetch is produced. The first two rows of described as unsafe area in Fig:3 , the coprocessor must be disabled in order to facilitate an SCA attack that produces the highest correlation coefficient. These conclusions are experimentally corroborated by Fig:4 , which shows the calcula- tion of the correlation coefficient during all instants of time included into the unsafe area and using the power models based on HD and HW, respectively.
275
These figures were obtained by applying to each clock cycle the attack process described in section 2.1, and representing only the value obtained for the highest correlation coefficient related to the false key. The HD model is quite effective in CLK 9, and it could be easily calculated since the previous (v m = 0x05, output of ShiftRows) and the subsequent (u m = 0x6B, output of SubBytes) values 280 written in lmb writebus are known. In contrast, note that when a byte is written on such a bus, its 24 most significant bits are pre-charged to 0 during three clock cycles. This behavior makes the HW the most effective model for the rest of the unsafe area. As can be seen in Fig:4 (lower trace), in accordance with the simulation results, in cycles CLK 10 and CLK 13 the value of the correlation 285 coefficient is quite significant.
The coprocessor knows the area of memory in which the SBox table is stored, so that it could be disabled by simply monitoring the address bus lmb abus.
In Fig:3 , this situation is represented by the signal Coprocessor halt, which is activated when such memory space is addressed. • If during the encryption of plain text T i the output of SBoxTrans is concealed with masks M h (i, j) (included in MEM0), then in the following 305 plain text T i+1 such output will be concealed with masks M h (i, j) (included in MEM1) and vice versa.
• The True Random Number Generator (TRNG), included as part of the coprocessor, creates a new set of masks M h (i, j) CREAT ED and M h (i, j) CREAT ED for each encrypted plain text T i and T i+1 , respectively.
310
• The created mask M h (i, j) CREAT ED , and the actual values of its corresponding group of memories concealed with M h (i, j) OLD , are operated with an exclusive-OR. Thus, the elements of such memories will be con-
The second group of memories related to M h (i, j) are up-315 dated in a similar way.
• The operations of creating and updating the masks are performed during the execution of rounds 5 and 6. Note that, when such rounds are processed the state depends on the 128 bits of the cryptographic key, so that performing an SCA attack is considered unpractical. This way, the 320 interfering noise created when updating these masks does not affect any potential attack performed on rounds in which the system is vulnerable and can reveal the false key (first and last round).
On the other hand, the signal and such element is stored in a second set of circular shift registers. Once this process is finished, the following column is pre-charged and the process, for computing the rest of columns of SboxTrans, is initiated again. The coprocessor
MixCol performs all these computations in 24 · T CLK .
Experimental and simulation results
340
Area and correlation results
In order to prove the correctness of our proposal, the complete system was implemented on a Virtex-5 FPGA clocked at 24 MHz. Power traces were measured using a Tektronix CT-1 current probe featuring a bandwidth range of 25 kHz to 1 GHz. The current probe was connected to an Agilent DSO1024A 345 oscilloscope, which captures and stores current traces using a sample rate of 2 GS/s. The implemented system includes a MicroBlaze microprocessor, a specific hardware (coprocessor) that synthetizes both the SboxTrans and MixCol blocks, and finally a set of peripherals used for debugging the application and provid- the overall system, and the maximum frequency given by the critical path, are represented in Table 1 .
Basically, the MicroBlaze executes the AES 128-bit algorithm by encrypting the plain text with Key F AKE . Table 2 shows the execution time of each of the four operations for a specific round. As can be seen, the SubBytes operation is 370 Fig:7 compares the total input current consumed by the device when the coprocessor is activated (Fig:7a) or disabled (Fig:7b) . The additional power consumption provided by its activation is concentrated on the next 51 · T CLK (2.12 µs), which includes both the processing and the communication delays.
The difference between the power consumption in both situations, activated or disabled, is perfectively distinguishable and measurable. Such difference, shown in Fig:7c , is higher than 5 mA (absolute value). Moreover, the energy consumed by the coprocessor is added on the global power, so that such energy behaves as a noise that deteriorates the correlation coefficient related to Key F AKE . In order to conceal such information, the coprocessor is slowed down by introduc-380 ing wait-state cycles, in such a way that its activity is distributed along the total time needed by the microprocessor for executing the individual operation
SubBytes. This proposal is shown in Fig:8 , in which it can be observed as the coprocessor distributes the calculation of functions SubTrans and MixCol during the 638 · T CLK (26.58 µs) employed for executing SubBytes. Thus, a total 385 of 587 wait-state cycles have been introduced. Furthermore, the difference between the input current traces when the coprocessor is activated or disabled (Fig:8.c ) is nearly unnoticeable (0.5 mA), so that the correlation coefficient will be unaffected by the addition of the faking countermeasure. In fact, the value represented in Fig:8 .c for such difference is almost constant whatever the state 390 of the coprocessor is. Additionally, taking into account that the actual values processed by the coprocessor are concealed with M h (i, j), an SCA attack on SBoxTrans or MixCol would be unsuccessful even in the hypothetical situation in which its power consumption could be isolated from the rest of the system. is included.
410
The experiments shown in Fig:13 (non-protected) and Fig:14 (protected) represent the evolution of the maximum correlation coefficient over an increasing number of plain texts for an attack performed on function SubBytes. These results are almost identical when the coprocessor is activated or disabled, so that the attacker is unable to find out if the system is protected by the faking coun-415 termeasure or not. As can be seen, capturing about 25 power traces is enough to reveal the false key. Similarly, Fig:15 shows the same attack performed on
MixColumns function which leads to identical results.
Finally, Fig:16 shows an attack based on the differential-of-means method proposed by Kocher. Unlike the original attack, which was performed on a single 420 bit, our proposal is targeted on a complete byte following a similar strategy that introduces some modifications:
• The process is applied on each bit j included in the byte to be analyzed.
• For the specific bit j, in which the attack is initially focused, the N current traces are separated into two groups, depending on the value that such a bit takes on the power consumption model for a particular plain text and a specific key K n (n=0..255).
• For each key K n , the average of each group is calculated and the difference between each average is assigned to the element d(j,n) (j=0..7, n=0..255) of a matrix D.
430
• The process is repeated for all bits and keys until matrix D is completed.
• For each column n of matrix D, its average value D n (n=0..255) is calculated. The maximum value of D n indicates the correct key.
In order to obtain a comparison between the hardware software implementation presented in this paper and a completely software execution, several plain 435 texts have been encrypted by executing the faking countermeasure using only the microprocessor (the coprocessor is disabled). The results are shown in Table 3 . In this case, the execution time for one round is 4847 · T CLK (about 202µs), which is almost twice when compared with the time needed by our proposal shown in Table 2 . Note that, the execution time of the same function performed by the microprocessor in both experiments shows some slight differences depending on whether the coprocessor is or not activated. For instance, in the complete software implementation the function MixColumns is used for both processing the state and calculating the block MixCol. Thus, in order to differentiate both situations an additional processing should be included, which 445 produces such difference in the execution time. Finally, the third column of Table 3 presents the results for one round when the faking countermeasure is disabled and the plain text is encrypted by the microprocessor (reference system). As can be seen, the execution time is almost identical when compared with the hardware/software implementation, being the main difference due to 450 the delay communications created when writing or reading on the FSL bus (communication between the Microblaze and the coprocessor).
Comparison with other proposals
A fair comparison of our proposal against previous publications should be carefully performed. The results, in terms of area and speed, depend strongly on 455 the FPGA family used for implementing the system. Thus, the coprocessor was on a hardware-software co-design. Thus, as the coprocessor only performs the countermeasure to protect the real key, its performance is usually higher. Table 4 shows such comparison for different protected implementations of the AES algorithm performed on several FPGAs. As the coprocessor is masked, only designs protected by masking were included.
465
The implementation proposed by Reggazoni et al. [21] is performed on a Virtex 5. Authors presented two different structures based on a datapath of 32-bit and 128-bit, respectively.As the MixCol block included in the coprocessor is based on a 32-bit datapath, results are compared according to this design.
As can be seen, our implementation is carried out in 253 slices, so that only 470 the 40% of the logical resources used in that publication are needed for implementing the coprocessor. However, the maximum throughput provided by the implementation of Reggazoni is higher leading to a better result. Note that, our coprocessor was designed aiming at minimizing its power consumption per clock cycle. Such a strategy provides the highest correlation coefficient related 475 to the false key, but at the expense of producing a lower throughput.
The protected system proposed by Kaumon et al. [17] is implemented in a The paper presented by Sasdrich et al. [23] shows an alternative implementation of BMS, but using distributed RAM memory based on Slice-M LUTs available in modern FPGAs. They reduce by half the resources needed by the original proposal of Güneysu, and additionally, the throughput is increased by 490 two. Our coprocessor, designed in a Spartan 6, could be implemented using less LUTs (518 against 1284) but more Flip-Flops (690 against 415 FF) than this improved proposal. Results for the maximum frequency and throughput are almost identical.
Conclusions
495
A new countermeasure against SCA attacks and its implementation based on a software/hardware co-design was presented. The effectiveness of such countermeasure relies on revealing a false key, rather than eliminating the statistical dependence between data and power consumption that is usually performed by classical approaches. Functions implemented in hardware by the coprocessor 500 hardly affect the power consumption, so that their effect on revealing the false key by means of a correlation attack is unnoticeable. In contrast, as the countermeasure is solved in parallel with the execution of the original algorithm by the microprocessor, there is no penalty on the execution time. The complete system was implemented on a Virtex 5 FPGA in order to obtain experimental 505 results that corroborated the efficiency of our proposal.
