Abstract. Side channel cryptanalysis techniques, such as the analysis of instantaneous power consumption, have been extremely e ective i n attacking implementations on simple hardware platforms. There are several proposed solutions to resist these attacks, most of which are ad hoc and can easily be rendered ine ective. A scienti c approach is to create a model for the physical characteristics of the device, and then design implementations provably secure in that model, i.e, they resist generic attacks with an a priori bound on the number of experiments. We propose an abstract model which approximates power consumption in most devices and in particular small single chip devices. Using this, we propose a generic technique to create provably resistant implementations for devices where the power model has reasonable properties, and a source of randomness exists. We prove a l o wer bound on the number of experiments required to mount statistical attacks on devices whose physical characteristics satisfy reasonable properties.
Introduction
Side channel cryptanalysis i.e., cryptanalysis using information leaked during the computation of cryptographic primitives has successfully been used to break implementations on simple platforms such a s c hip cards 1, 2, 8 . It has been claimed 2 that in chip card like devices, all straightforward implementations are susceptible to attack b y p o wer analysis techniques. Analogous to timing attacks 1 , simple power attacks, where the adversary extracts key bits by identifying the execution sequence from the instantaneous power consumption, are easier to protect against by making the execution sequence independent of the key bits. Di erential and Higher order Di erential power attacks, on the other hand are extremely powerful and di cult to protect against. These attacks rely on the ability of the attacker to create two di erent statistical distributions on the values being manipulated during a single instruction or a set of instructions based on known input output and guesses of few key bits. If these distributions can be distinguished, using statistical tests on instantaneous power samples or any other side channel, then the attacker can verify the key guesses.
Due to the wide ranging impact of these attacks, there have been several proposed commercial implementations which claim to resist these and similar attacks. Without rigorous justi cation, several of these solutions are ad hoc and based on simplistic techniques such as probabilistically reordering execution paths. It is our view that such approaches essentially miss the import of these attacks and their underlying basis. Furthermore, these simple countermeasures can be nulli ed by signal processing. Instead, we propose that the focus should be on sound scienti c approaches to this problem i.e. develop an accurate and abstract model of the problem and identify rigorously proven techniques which can be used as e ective countermeasures. A realistic goal for a sound and e ective countermeasure is to provably resist all generic attacks by a n a d v ersary who is allowed to perform and observe at most an a priori xed number of adaptively chosen operations. Generic attacks are those that use functional speci cations and generic physical characteristics and do not depend on speci c implementations and devices. Typically, simple single chip devices and the keys within them are short lived and the numb e r o f k ey dependent operations performed by them over their lifetime is also limited. A countermeasure secure against an a priori bound on the number of experiments, can be used in practice by explicitly enforcing this upper bound in the devices themselves.
In this paper, we propose a general, simpli ed model for the power consumption in simple devices, and use this to restate the basis of these attacks and to analyze countermeasures. We examine ad hoc approaches which h a ve been proposed and discuss why they are easy to defeat. A general technique is then proposed as a countermeasure against statistical attacks in devices where the power model is reasonable and a source of randomness is available. This technique is based on well known secret sharing schemes where each bit of the original computation is divided probabilistically into shares such that any proper subset of shares is statistically independent of the bit being encoded and thus, yields no information about the bit. Computation of cryptographic primitives is done accessing only the random shares at each point, with intermediate steps computing only the shares of the result. Splitting the bit into multiple probabilistic shares ampli es the uncertainty of the adversary at each point and forces him to work with the joint distributions of the signal at the points where the shares are being accessed. For computation of common cryptographic primitives, simple sharing schemes based on XOR and addition modulo 2 8 can be used. We make realistic assumptions about the power consumption model for devices with respect to the uncertainty of the adversary at each point, and analyze the e cacy of this technique to withstand power analysis attacks. Using this, we rigorously prove l o wer bounds on the number of observations required to statistically distinguish distributions, de ned in power attacks, using observations on power samples. Our lower bounds are exponential in the number of shares that each bit or byte of the computation is encoded by. The models and lower bounds are initial steps in developing a formal framework for the problem of computation in the presence of the information leaked due to the observations on the physical characteristics of devices. Only solutions which can be proved secure in a formalized model should be considered for implementation. Substantial e ort is still required to nd more appropriate models and stronger analysis.
Section 2 describes a formalization of a general power consumption model. Section 2.1 restates power analysis attacks in this framework.
In section 3, we analyze countermeasures, examine simple ad-hoc solutions which are ine ective, and propose the secret sharing scheme as a general countermeasure against these attacks. In Section 3.4, we make realistic approximations on the model and rigorously prove l o wer bounds on the number of samples needed to mount di erential power attacks against implementations with this countermeasure.
2 Power Model and De nitions CMOS devices consume power only when changes occur in logic states, while no signi cant p o wer is needed to maintain a state. Examples of changes include changes in the contents of the RAM, internal registers, bus-lines, states of gates and transistors etc. In simple chips almost all activity is triggered by a n i n ternal external clock edge and all activity ceases well before the next clock edge. A few processes, such as on-chip noise generators, operate independently of the clock and consume a small, possibly random amount o f p o wer continuously. Each clock edge triggers a sequence of power consuming events within the chip, as dictated by the microcode, bringing it to the next state. This sequence depends on parts of the current state of the processor and parts of the state of other subsystems accessed in that cycle. We de ne relevant state bits as the bits of the overall state which determine the sequence of events, and hence the power, during a clock cycle. Depending on the cycle, the relevant state bits could include bits of internal registers, bits on internal and external buses, address bits and contents of memory locations being accessed etc. The instantaneous power consumption of the chip shortly after a clock edge is a combination of the consumption components from each of the events that have occurred since the clock edge. Each e v ent's timing and power consumption depends on physical and environmental factors such as the electrical properties of the chip substrate, layout, temperature, voltage etc., as well as coupling e ects between events of close proximity. As a rst approximation, we ignore coupling e ects and create a linear model, i.e., we assume that the power consumption function of the chip is simply the sum of the power consumption functions of all the events that take place.
Consider a particular cycle of a particular instruction in the execution path of some xed code. At the start of the cycle, the chip is in one of several relevant states determined by the value of the relevant state bits depending on the input and processing done in earlier cycles. Let S denote the set of possible relevant states when control reaches this cycle and let E be the set of all possible events that can occur in a cycle. For each s 2 S , and each e 2 E , let occurse; s be the binary function which i s 1 i f e occurs when the relevant state is s and 0 otherwise. Let delaye; s be the time delay of the occurrence of event e in state s from the clock edge and let fe; t denote the power consumption impulse function of event e with respect to time t t = 0 when e occurs and fe; t = 0 f o r t 0. In our linear model, Ps; t, the power consumption function of the chip in that cycle with state s and time t after the clock edge can be written as Equation 1 shows the strong dependence between the power consumption function and the relevant state which is the basis of all statistical power attacks. Let P 1 and P 2 be two di erent probability distributions on the relevant state before the clock edge of a certain cycle. From equation 1, it is very likely that the distribution of the instantaneous power when the state is drawn from P 1 will be di erent from the distribution of the instantaneous power when the state is drawn from P 2 . This di erence and the distinguishability of di erent distributions by statistical tests on power samples, is the basis for Di erential Power attacks DPA. Simple distributions are su cient to mount these attacks. For example, in the DPA attacks described in 2 , P 1 and P 2 are very simple: P 1 is the uniform distribution on the set of all relevant states which h a ve a particular relevant state bit 1 and P 2 is the uniform distribution on the set of all relevant states which h a ve the same bit 0. The di erence in the power distribution for these two cases represents the e ect of that particular relevant state bit on the net power consumption. This can be used to extract cryptographic keys by guessing parts of keys, using this to predict a relevant state bit and de ning the distributions as described above. Higher order di erential attacks are those in which the distributions P 1 and P 2 are de ned over multiple internal state variables and where the adversary has access to multiple side channels. Appendix 1 shows an example of distributions induced in the power consumption signal by distributions on the relevant state, for an actual chip card.
In de ning security against statistical attacks we use the strongest possible notion: using the side channel, with high probability, the adversary should not be able to predict with even a slight advantage, any bit that he could not predict from just the knowledge of inputs, outputs and program code. In using the side channel, the adversary is limited to trying to distinguish distributions which h e can a ect by c hoice of inputs and selection based on outputs. These are limited to distributions on bits such as bits in the algorithm speci cation e.g., bits of the key, bits which depend directly on the key and the input etc. Also, in an attack against the implementation the adversary could a ect deterministic temporary variables, registers etc. We informally de ne the set of realizable distributions which the adversary can directly a ect as follows:
De nition1. A distribution on state bits is realizable if the adversary can induce the distribution by suitable choice of inputs and selection based on outputs. In particular, this excludes distributions of state bits which result from explicit randomization introduced in the implementation outside of the speci cation. The ability to distinguish any t wo realizable distributions is potentially advantageous to the adversary. Using standard notations see for example 7 we de ne the distinguishing probability of an adversary as follows:
De nition2. Let M be a binary valued adversary who adaptively chooses k inputs and has access to the side channel signals for the corresponding operations. Let B 1 and B 2 be any t wo realizable distributions on the bits of a computation, and D 1 and D 2 the distributions induced on the side channel signals, by the choice of inputs and B 1 and B 2 respectively. Let M D denote M's output when given k input output pairs and corresponding side channel samples from a distribution D. The distinguishing probability o f M when given samples from distributions D 1 and D 2 is j PrM D1 = 1 , PrM D2 = 1 j: M is said to distinguish B 1 from B 2 using k side-channel samples, if the distinguishing probability of M, o n D 1 and D 2 , is at least some constant c. Using this de nition of adversaries, we de ne a secure computation. We i n tend to capture extra information that the adversary obtains from the side channel.
De nition3. A computation is said to be secure against N sample side channel cryptanalysis, if for all adversaries M and all realizable distributions B 1 and B 2 , if M can distinguish B 1 from B 2 using fewer than N samples, then M can distinguish B 1 and B 2 without the side channel. The attacks described by 2 can be restated as using the side channel to distinguish distributions B 1 and B 2 which correspond to almost uniform distributions on a few relevant state bits, with a particular state bit depending on the input and key being 0 and 1 respectively. There the adversary bases its decision by comparing the mean of the given samples, with some known threshold.
Countermeasures to Power Analysis
Using these formal de nitions of side channel cryptanalysis, we discuss general countermeasures against such attacks. First, we examine several ad hoc approaches to xing this problem, which, we believe, miss the import of these attacks and can easily be rendered ine ective. We present a probabilistic encoding scheme with which w e can e ectively perform secure computations. Based on realistic approximations of the power models of Section 2 we prove l o wer bounds on the number of samples required to distinguish distributions.
Ad hoc Approaches
Due to the commercial impact, several ad-hoc solutions are currently being implemented and claim to be resistant to these statistical attacks. Unfortunately, most can be defeated by signal processing in conjunction with only moderately more samples. Allowing for about 1 million possible experiments, it is reasonable to assume that the adversary can exploit every relevant state bit in any instruction to mount a statistical attack, provided he can e ciently predict that bit in a signi cant fraction of the runs based on the code speci cation, known inputs and small number of guesses for parts of the key.
Some approaches to protecting computation use simple countermeasures such as balancing", i.e., try to negate the e ects of one set of events by another complementary" set. For example, by ensuring that all bytes used in computation have Hamming weight 4, one can try to negate the e ect of each 1 bit by a corresponding 0 bit. Such approaches fail at high resolution and large number of samples, because the power consumption functions and timing of two complementary" events will be slightly di erent and the adversary can maximize these di erences by adjusting the operating conditions of the card. Another popular approach is to randomize the execution sequence i.e. keep operations the same, but permute the order e.g. in DES, the S boxes are looked up in a random order. Unless this random sequencing is done extensively throughout the computation, which m a y be impossible since the speci cation forces a causal ordering, it can be undone and a canonical order re created by signal processing. Attacks can be mounted on the re ordered signals. Even if the entire computation cannot be canonically reordered, it is su cient to identify corresponding" sample points in di erent runs so that a signi cant fraction are samples from the same power function P for the same cycle. All statistical attacks that work for P are also applicable to corresponding" points, although more samples would be needed due to noise" introduced by unrelated samples. In the case of permuted S-boxes, if the permutation is random, in 1 8 of the runs S-box 1 is looked up rst, and in the remaining samples, the signal at this point, corresponding to di erent lookups, is essentially random. Thus, even with no reordering, we n o w h a ve a signal which is attenuated by a factor of 8. Mounting the original attack with 64 times the number of samples yields the same results. Elementary reordering substantially reduces this factor. A similar countermeasure in hardware is typically achieved by making instructions take a v ariable number of cycles or by h a ving the cycles be of varying length see 4 . Once again, it is very easy to negate all these countermeasures with signal processing.
A general countermeasure
A general countermeasure is to ensure that the adversary cannot predict any relevant bit in any cycle, without making run speci c assumptions independent of the actual inputs to the computation. This makes statistical tests involving several experiments impossible, since the chance of the adversary making the correct assumptions for each run is extremely low. While this yields secure computation, it is not clear how one can do e ective computation under this requirement since no bit depending directly on the data and key can be manipulated at any cycle. In some cases the function being computed has algebraic properties that permits such an approach, e.g., for RSA one could use blinding 1, 3 to partially hide the actual values being manipulated. Another class of problems where this is possible is the class of random self reducible problems 9 . Such structure is unlikely to be present in primitives such as block ciphers.
Encoding
The encoding we propose is to randomly split every bit of the original computation, into k shares where each share is equiprobably distributed and every proper subset of k , 1 shares is statistically independent of the encoded bit.
Computation can then be carried securely by performing computation only the shares, without ever reconstructing the original bit. Shares are refreshed after every operation involving them to prevent information leakage to the adversary.
To x a concrete encoding scheme, we assume that each bit is split into k shares using any s c heme which has the required stochastic properties. For instance, bit b can be encoded as the k shares br 1 , r 2 , : : : , r k,1 , r 1 : : : r k,1 , where the r i s are randomly chosen bits. Furthermore, assume that each share is placed in a separate word at a particular bit position and all other bits of the share word are chosen uniformly at random.
In practice, it would be more useful, if each w ord of computation is split similarly into k shares. In that case, other schemes of splitting into shares based on addition mod 2 8 , subtraction mod 2 8 would also be viable. Encoding bytes of data manipulated by splitting them into shares would yield the optimal performance. Ignoring the initial setup time, the performance penalty in performing computation using just the k shares is a factor of k. Our results which h a ve been proved based on the bit encoding scheme would also work for this case but the bounds they yields are based only on the characteristics of the noise within the chip, and hence may not be optimal. This is discussed brie y after the analysis for the bit encoding case. The results and analysis we present here can serve a s a framework in which t o p r o ve results for the byte encoding scheme.
The method to encode the bit in secret shares should be chosen based on the computation being protected. For instance, for an implementation of DES, the XOR scheme is ideal since the basic operations used are XOR, permutations, and table lookups. Table lookups can be handled by rst generating a random rearrangement of the original table since a randomized index will be used to look up the table. This step increases the overhead beyond the factor of 2.
In practice, the splitting technique needs to be applied only for a su cient number of steps into the computation until the adversary has very low probability of predicting bits, i.e., till su cient secret key dependent operations have been carried out. Similar splitting also has to be done at end of the computation if the adversary can get access to its output. For instance, in DES, one needs to use the splitting scheme only for the rst four and last four rounds.
Analysis
We analyze the encoding scheme described above, by making reasonable assumptions on distribution of side channel information and prove that the amount o f side channel information required grows exponentially in k, the number of shares. For concreteness we x the XOR bit encoding scheme and consider the instantaneous power consumption at some time instant in a cycle manipulating a share. The relevant state in that cycle will not only include a share of the bit, but also all the other random bits in the word. It is quite reasonable to assume that the contributions of all bits in the word will be similar in magnitude. From equation 1, expanding occurse; s as a linear form over the bits of s, the instantaneous power consumption when a particular share is being manipulated will be P = b s 0 + P s 0 + R where b is the contribution of just the shared bit s 0 , P s 0 is the distribution of power contributions of events which require s 0 and other state bits and nally, R is the distribution of events which are independent of the bit s 0 . In operations such as load, store and XOR, if s 0 is a bit in a word being manipulated, the factor P can be viewed as a small perturbation on the real value b. In simple operations there is no interaction" between the di erent bits of the value being manipulated and an approximation, we will ignore the contribution of the variable P. The random variable R is typically much larger than b since it includes the sum of similar contributions from all other bits. For most operations, R is the sum of almost independent distributions which i s v ery well approximated by the Normal Distribution. Thus, we make the realistic assumption, which has been empirically tested as shown in Appendix 1, that R has a normal distribution with mean and variance 2 . The results we prove can also be shown to hold in the case that R is the sum of i.i.d's, which is the case for operations such as load, store, XOR. Further work needs to be done to analyze more complex and precise distributions which model all chip card operations such a s m ultiply where there is interaction between the bits being manipulated and it is unlikely that one can ignore the contribution of the variable P.
Assume that in each sample the adversary has access to the k signal values corresponding to the power consumption at instructions which access the shares br 1 ; r 2 ; : : : ; r k,1 ; r 1 r 2 : : : r k,1 . Rewrite these bits as r 1 ; :::r k , with r 1 : : : r k = b. Denote the distribution of the instantaneous power consumption signal at these points by random variables Z 1 ; Z 2 ; :::Z k . Also, let Z i = A i + X i , where A i is the contribution due to the bit in concern and X i is the additive factor which follows the distribution R. By the de nition of the encoding, A i takes values 0 and 1 with probability 1 2 each. Any noise in the contribution due to A i can be absorbed in R without a ecting the distribution of R since R is typically much bigger than b. T h us, the power contribution due to A i i s 1 i f r i = 1 and 0 i f r i = 0. It is important to note that the A i 's are not independent since A 1 ::: A k = b .
In de ning distributions that an adversary can try to distinguish using inputs, outputs and the side channel information,note that the adversary can not control Note that this not a tight l o wer bound and we conjecture that n k is the tight bound. We s k etch the proof for the case k = 2 and the general proof can be done along the same lines. We require the following basic facts from probability theory.
Probability Theory Basics
The density function of Normal distribution with mean and variance , where is a constant. We show that no adversary can distinguish between sequences with at most m samples, sampled according to D 1 and D 2 . In the following exposition, T is a random variable denoting a randomly drawn sequence, and s denotes a possible value of T . The outline of the proof is as follows: We rst de ne a set of bad sequences de nition 6 below. Then we show in Lemma 8 that under distributions D 1 and D 2 the probability that a sampled sequence T is bad i.e. the probability of the event BAD T is very small. Restricting ourselves to sequences which are not bad, we show that the probability that the random variable T has a particular value s is almost the same whether we are sampling according to D 1 or D 2 . In particular, in Lemma 10 we show that Pr D1 T = sj:BAD T n Pr D2 T = sj:BAD T , where n is close to 1, from above. Similarly, w e show that the probability of a sequence which is not bad, when sampled according to D 2 is at least ,1 n times its probability under D 1 . In other words, the occurrence probability of a sequence that is not bad, is almost the same under both distributions. Putting it all together, we then show that the adversary cannot distinguish the distributions using fewer than m samples. We begin with the de nition of bad sequences. In the above de nition and in the rest of the proof we h a ve ignored the cases when u; v 0 and these can be treated symmetrically. ,logn . The probability that the sequence is bad according to case 1 is at most m times this small probability.
2 Thus the space of bad sequences is very small. We n o w argue that for sequences that are not bad, the probability of occurrence is the same under both distributions. Denote Pr D1 T i = h,u; ,vi, by X u;v . Also, let Pr D2 T i = h,u;,vi be X u;v + u;v . The di erence u;v is due to the contribution of the binary valued random variable. The following lemma bounds u;v . m. The fth inequality follows by the power series expansion of e x , from which it can be shown that for x 1, e x 1 + 2 x. Since n is close to one, and Pr D1;D2 :BAD T is also close to one, the rst summand on the right of the above inequality is close to zero. The second summand is also close to zero as Pr D1;D2 BAD T is close to zero by lemma 8. Thus, the distinguishing probability is close to zero. 
Encoding Bytes
For practical computation, we w ould use the encoding scheme of splitting each relevant b yte of the computation into k shares. It is clear from our proof techniques that if there was enough additional noise in the power signals to e ectively mask the byte values, then the same proofs will go through for the byte encoding scheme. It seems unlikely to happen in limited devices. It may be possible to extend our proof techniques to account for the fact that there is uncertainty o n the value of a byte being manipulated given its power signal even without any additional noise.
We h a ve presented a simpli ed initial step into the formal analysis of computing in the presence of loss of entropy due to leaked side channel information. Our lower bounds on the amount of side channel information required are proved for reasonable approximations of the actual distributions. Substantial e ort is required to nd more e ective and general countermeasures against such attacks. Besides proving implementations secure from power attacks, this framework could also be used to design ciphers and other primitives which readily admit a secure, e cient implementation. Figure 1 shows three distinct distributions of the instantaneous power consumption of a commonly a vailable chip in the middle of a cycle which loads the value of a RAM byte into the accumulator. These correspond to three di erent distributions on the value of that particular RAM byte. All three power distributions are plotted on a normal scale" and each distribution shows up as a thick line in this plot, which means all these three power distributions are close to normal. The middle line corresponds to the power distribution when the RAM byte is drawn uniformly at random. It has a mean of 0 we h a ve shifted all power readings by an additive constant to enforce this. The top line corresponds to the power distribution when the RAM byte is uniformly chosen from all bytes with MSB of 1. It has a mean of ,25. The bottom line corresponds to the power distribution when the RAM byte is uniformly chosen from all bytes with MSB 0. This has a mean of +25.
