This paper proposes a multi-core architecture with asynchronous clocks to prevent power analysis attacks for the first time. The multi cores normally execute different tasks with default clocks, but will execute the cryptographic algorithm together with asynchronous clocks to foil the side channel attacks. The cryptographic algorithm is split into multi parts, each of which is executed simultaneously by one core. Security analysis and simulation results show that the differential power analysis (DPA) attack and correlation power analysis (CPA) attack fail on data encryption standard (DES) and advanced encryption standard (AES) with the proposed architecture.
Introduction
There have been several security challenges for the embedded systems in recent years. Power analysis attacks, an advanced form of side-channel attacks (SCAs) published by Kocher et al. in [1] , use the power consumption during the encryption or decryption to reveal the secret key for embedded systems. Differential power analysis (DPA), the most effective type of power analysis attacks, requires a good deal of power traces to perform statistical analysis to predict data used in the computations. The fundamental for this attack is that there is a significant difference in power consumption when processing "1" and "0". Correlation power analysis (CPA) [2] , an effective DPA, is based on correlation coefficient between the instantaneous power consumption and the processed data in the devices. The key to the success of DPA and CPA is all the power traces should be correctly aligned. This means that the power consumption values at fixed time need to be caused by the same operation. Various countermeasures against power analysis attacks have been proposed in the decade and can be categorized into several groups: masking, hiding and introducing noise. The principle of the masking methods [3] is to break the relationship between the power consumption and the intermediate values that are processed by the cryptographic devices. However, some masking methods are only effective in Convenient Model [4] and they require designing specific cells and secure logic, which increase the cost in terms of area and power consumption. The basic idea of the hiding methods, such as wave dynamic differential logic (WDDL) [5] , is to make the power consumption of different transitions the same as possible. The hiding methods need balancing circuits which make the area and power consumption large. Introducing noise methods are by adding extra noise to reduce the correlation between the processed data and power consumption [6] . But they also increase the area and power consumption.
In this paper, we propose a multi-core architecture with asynchronous clocks to prevent power analysis attacks. The original cryptographic algorithm should be modified to parallel cryptographic algorithm. Parallel cryptographic algorithm is to split the original cryptographic algorithm into multi portions, and then executed simultaneously by different cores. The proposed architecture is theoretically applicable to all cryptographic algorithms. The clocks of multi-core are asynchronous when the parallel cryptographic algorithm is executed. As far as we are concerned, using the multi-core architecture with asynchronous clocks to prevent power analysis attacks is the first time introduced.
2 System architecture and parallel cryptographic algorithm 2.1 Overview of system architecture The simple architecture diagram of the multi-core is shown in Fig. 1 . As depicted, the system has two identical cores (CORE0 and CORE1) with separate local memory (not shown in Fig. 1 ). Each core can communicate with each other through the Shared Memory. When the parallel cryptographic algorithm is executed, the clocks of the multi-core are asynchronous since they are generated by the clock generator with the random numbers produced by random number generator (RNG). Nowadays RNG already exists for secure smart card applications and is not additional overhead [7] . The clock generator and CONTROL UNIT are the only additional modules required for this countermeasure.
Clock switching control
In our proposed architecture, the normal programs are executed by different cores with the default clocks and the RNG is disabled, as the purple line shown in Fig. 1 . Only when the parallel cryptographic algorithm is scheduled, the clocks of multicore will be switched from the default clocks to asynchronous clocks. In order not to influence the execution of normal programs, the clocks of the multi-core should also be switched from asynchronous clocks to normal clocks when the cryptographic algorithm is executed completely. These two processes are called clock switching. In our implementation, clock switching is triggered by two signals (START and END), which are sent to CONTROL UNIT (CUNIT) at the beginning and end of each parallel cryptographic algorithm. The START signal indicates the parallel cryptographic algorithm is scheduled and the END signal means the parallel cryptographic algorithm has been executed completely.
Here we assume that the original cryptographic algorithm is split two parts, that is, the parallel cryptographic algorithm consists of two programs named A and B, which is executed by CORE 0 and CORE 1, respectively. The clocks of the CORE0 and CORE1 are called Clock 0 and Clock 1.
When program A is scheduled in CORE 0, START_0 signal is sent to CUNIT and an external interrupt is raised in CORE 1 at the same time. The clocks of the two cores should be switched from default clocks to asynchronous clocks. Then the Clock 0 is gated by the CUNIT. After the necessary registers are saved in the stack, program B is scheduled in CORE 1 and START_1 is sent to CUNIT. Then the Clock 1 is also gated by the CUNIT. After the clocks of the two cores are gated, the RNG is enabled by the CUNIT and generates random numbers, which are used for the clock generator to generate asynchronous clocks. Then the CORE 0 and CORE 1 are enabled by the CUNIT, and the parallel cryptographic algorithm is simultaneously executed on the two cores with asynchronous clocks. This flow is shown in Fig. 2(a) .
When the parallel cryptographic algorithm is executed completely, the clock switching is triggered by the END signal. Regardless of which program of A and B is the first to complete, the clock of the first completed core will be gated. The RNG will be disabled and the clocks of the two cores will be switched from asynchronous clocks to default clocks until all the programs are executed completely. At last, CORE 1 restores the registers from the stack and resumes its execution and next program will be scheduled on CORE 0. This flow is shown in Fig. 2(b) .
Parallel cryptographic algorithm
In general, most of cryptographic algorithms are constructed by two main functional modules: key schedule and enciphering computation. Based on the archi- tecture we propose, the cryptographic algorithms can be split into two parts and each part is executed by different cores simultaneously. Therefore, the basic idea of the parallel cryptographic algorithm is that key schedule and enciphering computation are executed by different cores simultaneously with asynchronous clocks, for instance, CORE1 executes key schedule to generate subkeys and writes each subkey to the Shared Memory, while CORE0 reads each subkey from the Shared Memory and executes enciphering computation. The proposed architecture is theoretically applicable to all cryptographic algorithms.
The most difficult and important part in parallel cryptographic algorithm is to make sure each subkey is read or written in sequence. Therefore a flag signal is set in the Shared Memory and can be changed by CORE0 and CORE1. The flag signal will be set to logic 1 or logic 0 according to CORE1 has written the subkey or CORE0 has read the subkey respectively. If the flag is equal to logic 1, which means CORE1 has written the subkey and CORE0 can read the subkey. Otherwise, it means CORE0 has read the correct subkey and CORE1 can write the next subkey. As shown in Fig. 3(a) , when CORE0 needs the subkey, firstly CORE0 sends a request signal to the Shared Memory, then the grant signal indicates whether the CORE0 gets the usage right of the Shared Memory or not. Then there are three paths: Path 1, Path 2 and Path 3, respectively. Path 1 shown in purple dotted line in Fig. 3(a) declares the state of the Shared Memory is busy and CORE0 should wait until the Shared Memory is idle. Path 2, which is shown in blue line in Fig. 3(a) , means the Shared Memory is idle, but the CORE0 cannot read the subkey because the CORE1 has not written the subkey. So, CORE0 should give up the usage right and wait until the CORE1 writes the subkey. As shown in green dotted line in Fig. 3(a) , Path 3 indicates CORE0 reads the subkey and then CORE0 sets the flag to logic 0. The flow of CORE1 writing the subkey is similar to CORE0 reading the subkey as shown in Fig. 3(b) .
Experimental setup
We implement the multi-core system based on two ARM Cortex-M0 processor cores. The multi-core system has been verified in Verilog-HDL and evaluated in 40 nm CMOS standard cell library. As depicted in Fig. 4 , the simulation platform mainly is constructed by a gate-level simulation module, a power simulation module, a power value extraction module and a power analysis attack module. Synopsys VCS tool is used to simulate the synthesized design, with the gate-level netlist, random input phaintexts, standard delay format (SDF) and other necessary vectors in gate-level simulation module. The output of the gate-level simulation module is transformed to value change dump (VCD) format, and then is used in power simulation module to obtain power consumption value by means of using power simulation tool PTPX with time-based mode. The power traces are extracted by Python scripts and imported into MATLAB. The power analysis attack modules are realized by MATLAB program. As we mentioned above, our proposed architecture is theoretically applicable to all cryptographic algorithms. In order to verify the feasibility and security of the proposed architecture, two well-known cryptographic algorithms are chosen as the benchmarks, which are named DES and AES. Each of these two cryptographic algorithms is split into two portions and each portion is executed by different cores simultaneously with asynchronous clocks. In our experimentation, the clock generator is implemented by configurable ring oscillator with a programmable delayline [8] . The range of clocks used in the experimentation is from 1 MHz to 80 MHz.
Results
In our experimentation, DES and AES are selected as benchmarks. The split DES and AES are called PDES and PAES. The DPA and CPA are performed on original DES/AES with single core architecture and PDES/PAES with multi-core architecture. The single core architecture is based on the multi-core architecture, where CORE0 executes the original cryptographic algorithm while CORE1 is powered off. There is an important nonlinear selection function which is called S-Box and SubByte in DES and AES, respectively. The attack point where DPA and CPA performed is an output bit of the first S-Box/SubByte in the first round of the DES/ AES algorithm. In general, the least significant bit (LSB) of the output is chosen. Hence, only the power consumption of the first round of the DES/AES is recorded. To record the power consumption, a trigger signal is set at the beginning and end of the first round of the cryptographic algorithm in the implementation. Therefore, all power traces have the same static start point and end point. Fig. 5(a) shows the DPA values on DES with single core architecture for all 64 (2 6 ) possible keys in the attack point. As shown in Fig. 5(a) , the correct key (0d60) can be easily distinguished since the significant peak appears at a key of 0d60. Only approximately 250 power traces are needed to recover the correct key. Fig. 5(b) shows the correlation values of CPA attack for all the 64 key hypotheses. The hypothetical power consumption matrix is calculated using Hamming Weight (HW) leakage model in CPA attack. From Fig. 5(b) , we can see that a significant peak appears at the correct key. It can be seen that DPA and CPA can reveal the correct key on DES with single core architecture from the Fig. 5(a) and Fig. 5(b) . Fig. 5(c) shows the DPA values on the PDES with multi-core architecture for all 64 possible keys in the attack point. As depicted in Fig. 5(c) , DPA cannot reveal the correct key on the PDES, where the correct key (0d60) does not produce a significant peak. Fig. 5(d) shows the correlation values of CPA attack for all the 64 key hypotheses in the attack point on PDES with multi-core architecture. As depicted in Fig. 5(d) , the correlation value of the correct key is smaller than the wrong keys. It can be seen that DPA and CPA cannot reveal the correct key on the PDES with multi-core architecture from the Fig. 5(c) and Fig. 5(d) .
DPA/CPA on DES and PDES

DPA/CPA on AES and PAES
The DPA values on AES with single core architecture for all 256 possible keys in the attack point are shown in Fig. 6(a) . It can be seen that the correct key (0d43) is still predicted. Another peak in our curve is clearly apparent at a key value of 224. Reasons for such ghost peaks are analyzed in [2] . Fig. 6(b) describes the correlation values of CPA attack for all the 256 key hypotheses. The peak with correlation value of near 0.72 appears at the correct key (0d43). Therefore, the correct key information is successful obtained by the DPA and CPA attack from the Fig. 6(a) and Fig. 6(b) . Fig. 6(c) and Fig. 6(d) show the DPA values and correlation values on the PAES with multi-core architecture for all 256 possible keys in the attack point. From the Fig. 6(c) and Fig. 6(d) , we notice that the DPA value and correlation value of the correct key are less than the wrong keys. This means that both DPA and CPA can unsuccessfully attack the PAES with multi-core architecture. 
Security analysis
To demonstrate the security of our solution, we compare the execution of DES and the PDES with two different phaintexts (phaintext0 and phaintext1). The two different phaintexts are random chosen and the power consumption of the first iteration of DES are subtracted. Fig. 7(a) and Fig. 7(b) show the execution of the DES on a single core at fixed clock with two different phaintexts. On account of the clock of the core and the operation (first iteration of the DES algorithm) are fixed, the two power traces of the DES with single core are alignment. Power trace misalignment method is another common countermeasure such as randomized clock [9] . However, since power trace misalignment methods do not change the power consumption behavior of the core (or circuit or algorithm), but only misalign the power traces, alignment preprocessing technique [10] could be used to defeat power trace misalignment methods. Fig. 7(c) and Fig. 7(d) are the power traces of the DES running on a single core with different clocks. These two power traces have been processed by the alignment technique. From the Fig. 7(c) and Fig. 7(d) , we notice that the two power traces are alignment very well, and the difference of the two power traces is the amplitude variation. This is because the operation (first iteration of the DES algorithm) is fixed, and the clock frequency only affects the power value. Therefore, power trace misalignment methods are not an effective countermeasure for DPA and CPA. The executions of the PDES with two different phaintexts are shown in Fig. 7 (e) and Fig. 7(f ) . These two power traces have also been processed by the alignment technique. However, it can be seen that these two power traces are also misalignment. The operations of the first round of PDES at three asynchronous clock pairs are shown in Fig. 8 . From this figure, we notice that the operation of the first round of PDES architecture is variable. That is to say, the power consumption behavior of the algorithm is variable based the proposed architecture. This is because the original DES algorithm is divided into two parts, each of which is executed by one core with asynchronous clocks. Therefore, on account of the operation of the parallel cryptographic algorithm is variable with the random clocks, the parallel cryptographic algorithm with multi-core architecture is robust against alignment preprocessing techniques.
Compared with hardware implementations
In general, there are two hardware implementations of the cryptographic algorithms: ASIC and FPGA. However, for the ASIC implementation, each crypto- graphic algorithm should have their special hardware implementation, and it is an area overhead when the encryption operation is not executed. On the other hand, the need of supporting multi cryptographic algorithms on a chip is increasing, thus, the area overhead is increasing. Another drawback of the ASIC implementation is lack of flexibility since the cryptographic algorithm is fixed and cannot be modified after manufactured. At last, the ASIC implementation has a high financial risk. For the FPGA implementation, despite its flexible, any changes in cryptographic algorithms not only require a new HDL (Hardware Description Language) but also a new data interface for the corresponding algorithms [11] . There are some special attacks on the FPGA implementation of cryptographic algorithms, which is another drawback [12] . The proposed architecture is a universal and flexible architecture for all cryptographic algorithms in theory based on hardware-software co-design. The parallel cryptographic algorithm is executed when the cryptographic operation is needed. The other tasks can be executed by the multi cores in the normal time. Thus there is small area overhead for the proposed architecture. The comparison of the implementation of the ASIC, FPGA and the proposed architecture is shown in Table I .
Conclusion
In this paper, it is proposed that a multi-core architecture with asynchronous clocks to prevent power analysis attacks. The proposed architecture is theoretically applicable to all cryptographic algorithms. The original DES and AES are chosen as the benchmarks to test the effectiveness and security of the proposed architecture. DPA and CPA are performed on the PDES/PAES with the proposed architecture and the results show that our solution is indeed much secure. 
Acknowledgments
