In this paper, we propose a new design methodology to assess the risk for side-channel attacks, more specifically timing analysis and simple power analysis, at an early design stage. This method is illustrated with the design of an elliptic curve cryptographic processor. It also allows to evaluate the quality of countermeasures against these attacks by evaluating hamming distances for each signal and each register in a partial functional domain (e.g. datapath or controller). Thus a first order side-channel-resistant design can be obtained with system-level design in which the simulation can run faster than conventional HDL simulations.
INTRODUCTION
Secure systems have to be resistant against various malicious attacks in order to prevent a leakage of information or an unexpected use of the system. For silicon devices, the recent threats are Side-Channel Attack (SCAs) which are a collection of attacks that observe in a non-intrusive way computational timing, power variants, or electromagnetic radiation of the device. By simple observation or mathematical processing of the observed physical phenomena, one can retrieve secret data out of the device.
In an embedded system such as mobile phones, smartcards and RFIDs, an high-performance cryptographic function is required at low cost. Especially, it is a challenge to implement Public-Key Cryptography (PKC) such as RSA and Elliptic Curve Cryptography (ECC). With limited silicon resources and a limited power budget, the short key-lengths of ECC (proposed by Koblitz [1] and Miller [2] ) are often preferred over the more traditional RSA-based systems. From Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. GLSVLSI'07, March 11-13, 2007 , Stresa-Lago Maggiore, Italy. Copyright 2007 ACM 978-1-59593-605-9/07/0003 ...$5.00. a SCA point of view, ECC needs to be resistant to SCAs to protect private information (i.e. a secret key in a memory). Conventional hardware design flows only explore the tradeoff among area, power consumption and performance of a target device. Our new contribution is that we take SCA into account and integrate it in a system level design flow at an early design stage before VHDL or Verilog design starts.
Attacks on Implementations
Several classes of attacks can be distinguished. The most straightforward attack is a physical attack directly to the silicon. This is a powerful method if a probing point is available. For instance, it is easy to retrieve secret information by probing the data on a bus. The fault induction attack [4] is a well-known technique that works by disturbing the device to induce errors in the computation. These attacks are named active attacks after the technique.
A possibly more dangerous type of attack, undetectable for the embedded system, is called passive attacks. This attacks are based on measuring physical characteristics leaking from side-channels of the embedded system. Timing Analysis (TA) checks the computation time. If the execution time varies with the data or the key used in the computations, this can be detected by the attacker [5] . Simple Power Analysis (SPA) measures the power consumption during cryptographic operations and guesses the actual types of computations. In [6] , Kocher et al. introduced Differential Power Analysis (DPA) that also considers effects correlated to data values. Electromagnetic Analysis (EMA) and Acoustic Analysis (AA) were also introduced as effective SCA examples [7] .
Our Design approach
Countermeasures against SCA needs to be taken into account at all abstraction levels. This is illustrated for an ECC in Fig. 1 . A similar figure can be made for RSA or Hyper Elliptic Curve Cryptography (HECC). We chose ECC because it consists of multiple layers of computations. ECC is composed out of different layers of operation (scalar multiplication, addition/doubling, GF(p) operations) which each map into a datapath and a controller in HW. Possible SCA attacks are shown on the left side of the pyramid. Each HW module can potentially be attacked corresponding to its functionality.
At the cycle-true Register Transfer Level (RTL), TA and SPA attacks (here we call it simple SCAs) can be detected and countermeasures can be implemented. Of course, once a design is made resistant to the simple SCAs, the differential and higher order ones need to be addressed. They require solutions such as adding noise, introducing random delays or using a special style of logic [8] .
Paper Overview
The remainder of this paper is as follows. Section 2 explains the secure system-level design flow proposed in the paper. Section 3 briefly presents the background for ECC over a prime field. Section 4 shows experimental results of ECC implementation with the detailed explanation of the design flow. Section 5 concludes the paper.
A SECURE DESIGN FLOW
In this section, we look at a HW design approach for system design with consistent resistance against SCAs. The main source of side-channel leakage are conditional operations and condition-dependent signals only for simple SCA. HW can be made to perform with constant-timing behavior and thus will show better resistance against simple SCAs than SW. However, the condition-dependent signals of HW still exist in the design and make it more susceptible to simple SCAs. In our design flow, we verify the SCA-resistance at a system-level design stage. This way, SCA resistance can be taken into account jointly with cost and performance optimizations at an early design stage.
System-level design Tool
We use a simulation environment, called GEZEL [9] , which allows us to estimate immediate dynamic power consumption. We use toggle count per clock cycle (TCPC) as an approximation for this power. The toggle count is obtained directly out of the RTL model and does not require synthesis of the model. Despite this approximation, our experiments have shown that it is sufficient to build cryptosystems that are resistant to simple SCAs. As mentioned before, higher order attacks, such as DPA, need to be addressed in an actual implementation. However, the system-level design, the main focus of this paper, gives the designer an environment to get a quick and correct evaluation of first order attacks. It doesn't make sense to address higher order attacks (which are much more difficult and time consuming to mount) if simple first order attacks are possible. This is especially needed for cryptographic algorithms which contain multiple levels of complex arithmetic, as is the case for ECC and HECC. Because of the size and complexity of the design, these SCA resistant tests are very time consuming at gate level and impossible at SPICE level.
The GEZEL design environment allows us to try out different alternatives for the performance, area and resistance for SCAs at a cycle-accurate level. For each alternative, functional tests and simple SCA-resistance tests are verified as illustrated in Fig. 2 . We obtain the toggle count of the HW modules as a function of clock cycle in order to see the power pattern. These toggle counts allow us to analyze the risk for simple SCAs (especially SPA) and, if detected, to rewrite the HW module. After completing both tests by using cycle-accurate simulation, the HW model is converted into VHDL, and synthesized by a secure back-end.
RT-level toggle counting
The HW descriptions in GEZEL are expressions of cycleaccurate register transfers, containing operations and assignments on signals and registers. For example, assume an RTL expression as follows: This piece of code contains three operations: two assignments and an addition. In the first clock cycle, signal a changes from 0 to 3. The addition operation output will be 3 as well, and this value will be assigned to register b. The total toggle count for the first clock cycle thus equals 6. In the second clock cycle, signal a does not change value. The output of the addition operator will now change from 3 to 6 however (Hamming distance '110' -'011' = 2), and register b will change from 3 to 6 as well. The TCPC for the second clock cycle thus equals 4. We can continue in this way to obtain an approximation of the immediate dynamic power consumption. This methodology is very simple, and for example does not take glitching nor the implementation complexity of operators into account. For the purpose of simple SCAs on the other hand, it is adequate.
ECC OVER A PRIME FIELD
The basic operation in ECC is scalar multiplication. One scalar multiplication on an elliptic curve over GF(p) consists of multiple point additions and doublings. Each point addition or doubling is executed by a sequence of operations in the underlying field, GF(p). The performance of a PKC is primarily determined by the efficiency of the arithmetic operations in the underlying finite field. Therefore, the system architectures for ECC are normally designed to accelerate the field operations. We assume here that we have an operational unit available to perform modular arithmetic, the MALU (Modular Arithmetic Logic Unit). This unit will be implemented in order to have a constant execution time for a simple SCA-resistance.
Point addition and doubling can be performed according to the algorithms given in [10] for instance. They consists of series of modular multiplications and additions. Normally modular addition can be executed much faster than modular multiplications in HW. Therefore, if they have a different sequence of modular multiplications and additions, point doubling and addition can be easily distinguished. These are the so-called imbalanced point operations.
Scalar multiplication can be implemented as a repeated combination of point additions and doublings. The simplest algorithm is shown in Algorithm 1. It is vulnerable for simple SCAs if 4 in Algorithm 1 makes stalling cycles. One of the countermeasures is proposed by Coron [11] . The idea is that SCA resistance is ensured by computing both point addition and point doubling in a for-loop. However, this method decreases the performance because of the additional dummy point additions. There are also other countermeasures at the algorithm level [12] , [13] .
EXPERIMENTAL RESULTS
Here, the method for verifying simple SCA-resistance for the ECC-160p implementation will be discussed.
Countermeasures for TA
Our proposed countermeasures for TA are described as follows. As mentioned previously, modular multiplications and additions are basic operations for ECC. We create a countermeasure for SCAs by unifying the multiplication and addition instructions, i.e. we have a unified modular multiplicationand-add instruction which completes either operation in the same number of cycles. The MALU supports (XY ±S) mod p operation in a constant execution time. In addition, a dedicated HW controller for Algorithm 1 is implemented. The stalling cycles can be concealed by evaluating the branch conditions beforehand. The dedicated controller also uses a queue buffer to ensure that instructions for the MALU are dispached constantly. This micro-coded controller is a secure controller. Hence, the imbalance of Algorithm 1 can be solved without extra dummy instructions.
TCPC after TA-resistance
We collect toggle counts for every register in the whole design by running GEZEL simulation. We first observed the toggle count of registers as shown in Fig 3 after verify- ing only the functionality. More precisely, we evaluate the TCPC for the whole design since the total amount of Hamming distance is correlated with power consumption of the implemented LSI. The trace of the GEZEL simulation shows peaks which have constant intervals thanks to the constant execution time in the MALU. The period is precisely corresponding to one MALU's operation time (93 cycles). We can claim that TA-resistance is successfully implemented as intended. In other words, TA-resistance can be checked with the cycletrue functional simulations. However, if we see the height of the peaks, some peaks are relatively higher than others (e.g. the fourth and fifth MALU operations in point addition). They are always located at the same position in point doubling and addition. Therefore, this design has bugs in terms of SPA-resistance although its functionality and TAresistance are correctly implemented.
Then, in order to identify the reason of the "big peaks", we evaluate the partial TCPC. First we separate the toggle counts into two parts; one is obtained from the MALU and the other one is from the controller. From the results shown in Fig. 4 , we know the "big peaks" are mainly caused by the MALU. Moreover, potential risk of SPA in the controller becomes apparent because the periodical character of the peaks in the trace is responding to point operations. Thus, by doing partial simulation of the TCPC, we can find the hidden security bugs which can not be found by the conventional functional simulation. In the following sections, we focus on explaining how to debug the former security bug with partial toggle counts. 
Security Bug in the MALU
The MALU is composed of several registers and the datapath for operating (XY ± S) mod p. All registers are cleared to 0 before an operation. The registers provide the input data for the datapath (REG X , REG Y , REG S ), store the intermediate values (REG C0 , REG C1 ), and collect the result (REG R ). The toggle simulation result for each register in the MALU is shown in Fig. 5(a) . From the trace, we can find that the TCPCs for REGY and REGS are irregular. Especially, the irregular TCPC for REG S affects the height of the peaks and eventually leads to the "big peaks" in the first simulation. The reason of the irregularity is explained as follows. When setting the input values for each modular multiplication-and-addition, the MALU has three different procedures depending on the type of operation:
S denotes the bit inversion of S. It is used for the subtraction with 2's complement representation. Assuming that the probability that each bit of S has a 1 value is 0.5, the hamming distance in setting REGS can be regarded the same. However, no toggle occurs with REGS in the operation 3). This observation agrees with the simulation result completely.
Fixing the Security Bug
The simplest way of fixing the bug is to add a register which compensate the lack of toggle counts for REGS. Fortunately, the output register, REG R , is not used when REG S is used for the datapath. Therefore, we decide to reuse REG R for fixing this bug. In this way, we can put SCA-resistance into the design without area and performance penalties. The simulation result after fixing the bug is shown in Fig. 5(b) . The subtotal of the TCPC for REG S , REG R and REG Y shows a good regularity in the trace. As a result, Fig. 6 shows the TCPC trace corresponding to Fig. 3 . It is hard to distinguish point operations from the trace, i.e. a simple-SCA resistant design is successfully implemented with our proposed design method.
CONCLUSIONS
This paper presents a simple SCA-resistant design flow in a system-level. In our experiment, it is effective to evaluate the partial toggle count in order to identify potential security bugs. SPA-resistant design needs the additional verification steps, although TA-resistance can be checked in the functional simulation.
