Abstract. The core of the 3 rd Generation Partnership Project (3GPP) encryption standard 128-EEA3 is a stream cipher called ZUC. It was designed by the Chinese Academy of Sciences and proposed for inclusion in the cellular wireless standards called "Long Term Evolution" or "4G". The LFSR-based cipher uses a 128-bit key. In this paper, we first show timing attacks on ZUC that can recover, with about 71.43% success rate, (i) one bit of the secret key immediately, and (ii) information involving 6 other key bits. The time, memory and data requirements of the attacks are negligible. While we see potential improvements to the attacks, we also suggest countermeasures.
Introduction
ZUC [8] is a stream cipher designed by the Data Assurance and Communication Security Research Center (DACAS) of the Chinese Academy of Sciences. The cipher forms the core of the 3GPP mobile standards 128-EEA3 (for encryption) and 128-EIA3 (for message integrity) [7] . It was proposed for inclusion in the Long Term Evolution (LTE) or the 4 th generation of cellular wireless standards (4G).
1 ZUC is LFSR-based and uses a 128-bit key and a 128-bit initialization vector (IV). Some key points in the evolution of ZUC are listed in the following timeline.
Timeline:
⋆ A shorter and slightly older version of this paper is due to appear in the proceedings of Inscrypt 2012. It was written when the author was with the National University of Singapore. 1 Strictly speaking, LTE is not 4G as it does not fully comply with the International Mobile Telecommunications Advanced (IMT-Advanced) requirements for 4G. Put differently, LTE is beyond 3G but pre-4G.
-18 th June 2010: The Security Algorithms Group of Experts (SAGE) of the European Telecommunications Standards Institute (ETSI) published a document providing the specifications of the first version of ZUC. The document was indexed "Version 1.0". -26 th -30 th July 2010: Improvements and minor corrections were made successively to the C implementation of the ZUC algorithm of Version 1.0. These resulted in versions 1.2 and 1.3 of the ETSI/SAGE document. The preface to Version 1.3 was corrected and the resulting document released as Version 1.4.
-02
nd -03 rd December 2010 (First International Workshop on ZUC Algorithm): A few observations on the algorithm of Version 1.4 were reported (see [6] ) but none of these posed any immediate threat to its security.
-05 th -09 th December 2010 (ASIACRYPT): The algorithm of Version 1.4 was cryptanalysed by Wu et al. [24] and the results were presented at the rump session of ASIACRYPT 2010.
The attack reduces the effective key size of ZUC to about 66 bits by exploiting the fact that a difference set between a pair of IVs may result in identical keystreams.
-08 th December 2010: Gilbert et al. reported an existential forgery attack on the 128-EIA3 MAC algorithm.
The attack allows, given any message and its MAC value under an unknown integrity key and an initialization vector, to predict the MAC value of a related message under the same key and the same initialization vector with a success probability of 1/2.
Gilbert et al. also gave a modified version of the 128-EIA3 algorithm (cf. [12, Algorithm 2] ).
In the original 128-EIA3 construction, some 32-bit keystream words are used in computing the universal hash function, and then the next whole word of keystream is used as a mask. But in [12, Algorithm 2] , the first keystream word is used as the mask. The latter algorithm better fits the standard Carter-Wegman construction [5] .
-04 th January 2011: In response to Wu et al.'s key recovery attack, the initialization of ZUC was modified. Version 1.5 contains the new algorithm [8] . This algorithm is the one we analyse in this paper; we have been and shall henceforth be simply calling it "ZUC" (i.e., without any accompanying version numbers).
-05 th -06 th June 2011 (The 2nd International Workshop on ZUC Algorithm and Related Topics): Gilbert et al. presented an updated version (cf. [12] ) of their paper. In this they argue that [12, Algorithm 2] might have slightly greater resistance against nonce reuse.
-07 th June 2011 -26 th August2011: Changing the ZUC integrity algorithm of 128-EIA3 to [12, Algorithm 2] was being considered by the ETSI/SAGE in June 2011. Although [12, Algorithm 2] offers some advantages, they appear to be marginal.
In this paper, we present two timing attacks on ZUC, each of can (in the best case) recover with (nearly) 0.7143 success probability, (i) one bit of the key immediately, and (ii) information involving 6 other bits of the key. Before describing how this paper is organised, we shall discuss timing attacks briefly.
Timing attack: This is a side-channel attack in which the attacker exploits timing measurements of (parts of) the cryptographic algorithm's implementation. For example, in the case of unprotected AES implementations based on lookup tables, the dependence of the lookup time on the table index can be exploited to speed up key recovery [4] . A cache timing attack is a type of timing attack which is based on the idea that the adversary can observe the cache accesses of a legitimate party. The cache is an intermediate memory between the CPU and the RAM and is used to store frequently used data fetched from the RAM. The problem with the cache memory is that, unlike the RAM, it is shared among users sharing a CPU.
2 Hence, if Bob and Eve are sharing a CPU and Eve is aware that Bob is about to encrypt, Eve may initiate her cache timing attack as follows. She first fills the cache memory with values of her choice and waits for Bob to run the encryption algorithm. She then measures the time taken to load the earlier cache elements into the CPU; loading is quick if the element is still in cache (such an event is called a cache hit ; its complement is a cache miss) and not overwritten by one of Bob's values. This technique is known as Prime+Probe [18] . Cache timing attacks have been successfully mounted on several ciphers, notably the AES [4, 18, 26, 14] .
In [18] , two types of cache timing attacks are introduced -synchronous and asynchronous. In a synchronous attack, the adversary can make cache measurements only after certain operations of the cipher (e.g., a full update of a stream cipher's internal state) have been performed. In this attack scenario, the plaintext or the ciphertext is assumed to be available to the adversary. An asynchronous cache adversary, on the other hand, is able to make cache measurements in parallel to the execution of the routine. She is able to obtain a list of all cache accesses made in chronological order [26] . Here, there are different viewpoints on the resources available to the adversary. According to Osvik et al., the adversary has only the distribution of the plaintext/ciphertext and not sample values [18] . Zenner differs in [26] where he argues that the adversary can (partially) control input/output data and observe cache behaviour. Asynchronous attacks are particularly effective on processors with simultaneous multithreading. One of the timing attacks in this paper is an asynchronous cache timing attack, and the other is a straightforward timing attack that does not involve the cache.
Organisation: Section 2 provides the specifications of ZUC along with some notation and convention. The preliminary observations that lead us to timing attacks are listed in Sect. 3 and the attacks are detailed in Sect. 4. We follow this with an analysis of some design/implementation modifications that resist the attacks, in Sect. 5. In Sect. 6, we see possible improvements to the timing attacks and find that the proposed design modifications resist these improved attacks too. In addition, we see several highlights of our attacks such as the novelty of an employed technique. The paper concludes with a suggestion for future work, in the same section.
Specifications of ZUC
In this paper, we use several of the notation and convention followed in [8] in addition to that provided in Table 1 . As previously mentioned, the inputs to the ZUC cipher are a 128-bit key and a 128-bit IV. The algorithm has three parts or "layers" -a linear feedback shift register (LFSR) layer, a bit-reorganisation ("BR") layer and a nonlinear function F . -that are shown in Figure 1 . The execution of the algorithm proceeds in two stages -an initialization stage and a "working" stage. Each iteration of the algorithm in the working stage generates 32 bits of keystream output. We shall now detail the layers and stages to the level that is required for the understanding of the results to follow. For the complete specifications, the interested reader is referred to [8, Sect. 3] .
The LFSR layer: ZUC uses one LFSR that contains sixteen 32-bit cells containing 31-bit values s 0 , s 1 , . . . , s 15 . However, none of the 31-bit elements can assume the value 0; the remaining 2 31 − 1 values are allowed. The steps of the LFSR layer in the initialization mode comprise Algorithm 1. The steps of the LFSR layer in the working mode comprise Algorithm 2.
The BR layer: In this layer, 128 bits are extracted from the cells of the LFSR and four 32-bit words are formed. Three of these words (X 0 , X 1 , X 2 ) are used by the nonlinear function F , and the fourth word (X 3 ) is used in producing the keystream. The nonlinear function F : This function involves two 32-bit values in memory cells (R 1 , R 2 ), one 32 × 32 S-box (S), two linear transforms (L 1 , L 2 ) and the aforementioned three 32-bit words produced by the BR layer. The output of the function F is a 32-bit word W . The 32-bit keystream word Z, that is produced in every iteration of the working mode of the ZUC algorithm, is simply W ⊕ X 3 . The F function is defined as follows:
The LFSR layer in the working mode
s16 ← 2 31 − 1; 4: (s1, s2, . . . , s15, s16) → (s0, s1, . . . , s14, s15);
Key loading: The key loading procedure expands the 128-bit secret key and the 128-bit IV to form the initial state of the LFSR. In [8] , this key is denoted as k (= k 0 ||k 1 || . . . ||k 15 , where each k i is a byte) and the IV as iv (= iv 0 ||iv 1 || . . . ||iv 15 , where each iv i is a byte). In addition to k and iv, a 240-bit constant D (= d 0 ||d 1 || . . . ||d 15 ) is used in the key loading procedure. We shall now provide the binary representations of the d i 's first (in Table 2 ), followed by the key loading procedure. Table 2 . The constants di, i ∈ {0, 1, . . . , 15}, used in the key loading procedure d0 100010011010111 d8 100110101111000 d1 010011010111100 d9 010111100010011 d2 110001001101011 d10 110101111000100 d3 001001101011110 d11 001101011110001 d4 101011110001001 d12 101111000100110 d5 011010111100010 d13 011110001001101 d6 111000100110101 d14 111100010011010 d7 000100110101111 d15 100011110101100
Given this, the key loading is a set of very simple and straightforward steps given by:
The execution of ZUC: As mentioned earlier, the execution of the ZUC algorithm proceeds in two stages. We shall now describe these stages.
The initialization stage: This stage is given by Algorithm 3.
The working stage: This stage, in turn, has two sub-stages that are given by Algorithms 4 and 5.
Motivational Observations
We start with the following two trivial observations. Execute the BR layer; 3:
Compute the nonlinear function F taking as inputs the outputs X0, X1 and X2 of the BR layer; 4:
Compute the keystream as Z = W ⊕ X3; 5:
Run Algorithm 2; 6: until one 32-bit keystream word more than the required number of words is generated
Observation 1
The ZUC key is initially loaded directly into the 16 LFSR cells.
Observation 2 Multiplication and addition in the initialization mode and working mode of the LFSR layer are modulo (2 31 − 1). Other additions and multiplications are modulo 2 32 .
Addition modulo (2 31 − 1) of two 31-bit integers x and y is performed in [8] as follows. First, they are stored in 32-bit cells and z = x + y mod 2 32 is computed. If the end carry, meaning the carry-in at the MSB position of a 32-bit word/register/memory cell, is b, the MSB of the 32-bit z is first discarded and then this 31-bit word is incremented by b. This is implemented in C in [8] as:
It is to be noted that the increment step in Add() cannot regenerate end carry 3 because x, y ∈ {1, 2, . . . , 2 31 − 1} implies that u32 z has at least one zero in its 31 LSBs.
An end carry of 1 brings in one extra 32-bit AND operation and one 32-bit addition in the software implementation (in hardware implementation, we have 32 bitwise AND operations and one 32-bit ripple carry addition). Let T carry denote the total time taken by the processor to perform these additional operations and T denote the time taken to run the Add() subroutine without the step where z is incremented. We now have the following simple observation that forms the base of our timing analysis.
Observation 3
If the attacker observes that the time taken to run the Add() subroutine is T + T carry , then she necessarily concludes that the end carry is 1, and can use this to retrieve some information on the summands x and y in general and their MSBs in particular.
In Sect. 4, we shall show how we exploit Observations 1-3 to mount (partial) key recovery attacks on ZUC.
The Timing Attacks
In this section, we shall examine the first invocation of the LFSR layer in the initialization mode. Recall that the first step of Algorithm 1 is:
Given a 32-bit cell containing a 31-bit integer δ, the product 2 n ⊙ δ is implemented in C in [8] as ((δ ≪ n)|(δ ≫ (31 − n))) & 0x7F F F F F F F . Given this and the manner in which the key bits are loaded into the cells initially (see [8, Sect. 3] ), we see that the 31-bit summands on the RHS of (2) in the first round of the initialization mode are: (5) . . . d 13(0) iv 13 (7) . . . iv 13(0) k 13 (7) . . . k 13(0) d 13(14) . . . d 13(6) (14) . . . d 15(8) ] .
In the C implementation of ZUC in [8] , the z i 's are added modulo (2
4 using the Add() subroutine. Recall that the d i (j)'s are known (see Table 2 ). There is no vector [z 1(30) z 2(30) . . . z 6(30) ] such that an end carry is not produced. This is because d 0(14) = 1 and d 15(7) = 1. Let c 1 denote the carry bit produced by the addition of z 1(29) , z 2(29) and the carry coming in from bit position 28 (bit position 0 denotes the LSB), in the first step of the Add() subroutine. The sum bit in this addition is added with z 3(29) and the corresponding carry coming in from bit position 28. 
T := I 6 , where I 6 is the identity matrix of size 6.
Clarification: Among the MSBs of the 31-bit z i 's, all but the MSB of z 1 are known to us. Let us, for example, suppose that this unknown bit is 1. Then, we are bound to have a carry-out (in other words, carry-in at the bit position 31 or 'end carry'). Since the z i 's are added progressively modulo (2 31 − 1), we can have end carry produced many times (λ, say, in total). If the MSBs of the z i 's are all variables, λ is bounded from above by 5, the number of additions modulo 7 Because at least 15 bits of each of z 1 , z 2 , . . . , z 6 are constants, the assumption of uniform distribution cannot be right away made anymore. If the IV is a known constant, one can assume that the 40 key bits k 0 ||k 4 ||k 10 ||k 13 ||k 15 are uniformly distributed at random and compute P r(Γ i ), for i ∈ {1, 2, . . . , 7}, by running a simulation. Otherwise, the 40 IV bits iv 0 ||iv 4 ||iv 10 ||iv 13 ||iv 15 may also be assumed to be uniformly distributed at random, and the probabilities P r(Γ i ) estimated theoretically. However, the latter approach appears to be highly involved, so we instead performed Experiment 1.
Experiment 1
The key/IV bytes k 0 and iv 0 are exhaustively varied, setting every other key/IV byte to 0x00, and the cases where end carry is produced exactly once, when the z 1 , z 2 , . . . , z 6 are added modulo (2 31 − 1), are examined.
We found 6995 such cases (out of a total of 256 × 256 = 65536 cases). In 3444 of the cases, the vector was Γ 6 ; in 3030 cases, Γ 5 ; and in the remaining 521 cases, the vector was Γ 3 . (A few of these cases are listed in Appendix A.) Firstly, this affirms that there are binary vectors that occur in practice. Next, if these are the only such vectors that occur in practice, then we have recovered z 1(30) , or the MSB of k 0 , with probability 1 when the time taken to execute (2) is at its minimum. This minimum time period would naturally be T const + T carry , with T const being the constant time component (i.e., the sum total of the execution times of the steps, of the Add()'s invoked for (2) , that are independent of the respective x's and y's). With this, let us proceed to the second step of the initialization mode, viz.,
6 The probability distribution here is a priori. 7 Here, one may choose to ignore negligible biases in the carry probabilities. For example, when two 32-bit words are added modulo 2 32 , the carry-in at the MSB position is likely to be 0 with a very small bias probability of 2 −32 . Bias probabilities of the carries generated in modular sums have been examined in several works [23, 17, 21, 20] .
where u = W ≫ 1 (see Sect. 2). We shall now argue that there are significantly many cases where (3) does not involve an end carry generation.
We performed Experiment 1 again, this time counting the frequency at which the MSB of the 31-bit v took the value 0. The total number of such cases was 32840, translating to a probability of 0.5011. Therefore, v (30) appears to be uniformly distributed at random. The first value that u takes after it is initialised is W = (X 0 ⊕ R 1(ini) ) + R 2(ini) mod 2 32 , where R 1(ini) and R 2(ini) are the initial values of R 1 and R 2 , respectively. From [8, Sect. 3.6 .1], we infer that R 1(ini) = 0 and R 2(ini) = 0. Hence, W = X 0 and 
and this is value of u that goes into step 2 of the first invocation of Algorithm 1. Since k 15 (7) is an unknown key bit, u (30) can be reasonably assumed to be uniformly distributed at random. Given this, even if the carry-in at the bit position 30 were to be heavily biased towards 1, with 0.25 probability we would still have the carry-out to be 0. In summary, the minimum execution time of Algorithm 1 can reasonably be expected to be T ′ const + T carry , T ′ const being the constant time component, for at least 25% of the key-IV pairs. We shall now show two ways to measure the execution time of Algorithm 1 and, using it, recover key-dependent information.
Through cache measurements:
In [26] , Zenner makes a mention of a sidechannel oracle ACT KEYSETUP() that provides an asynchronous cache adversary a list of all cache accesses made by KEYSETUP(), the key setup algorithm of HC-256, in chronological order. Similarly, we introduce an oracle ACT Algorithm-3() that provides the adversary with a chronologically ordered list of all cache accesses made by Algorithm 3. Zenner does not mention in [26] whether or not such an ordered list normally contains the time instants of the cache accesses as well. We assume that the instants are contained in the list. This is a rather strong assumption because in the absence of the oracle, the adversary has to have considerable control over the CPU of the legitimate party, in order to obtain the cache access times.
Given this assumption, the adversary scans through the list and calculates the time difference between the third and the fourth accesses of the S-box S. The first access to S is when it is initialised. Before Algorithm 1 is invoked for the first time, the nonlinear function F is computed (see Algorithm 3). During this computation, S is accessed twice (see the definition of F in Sect. 2). The next (i.e., the fourth) access of S happens after a few constant-time operations (e.g., executing the BR layer, computing W ) that follow the first invocation of Algorithm 1. Let the time taken to perform these operations be denoted by T ′′ const . Then, the aforesaid time difference between the third and the fourth cache accesses of S provides the adversary with T ′ const + λT carry + T ′′ const , λ ∈ {1, 2, 3}.
The adversary can easily measure T carry , T ′ const and T ′′ const by simulating with an arbitrarily chosen key-IV pair (in practice, quite a few pairs will be required for precision). Thereby, the adversary obtains the value of λ. When λ = 1, the adversary is able to recover the MSB of k 0 immediately with probability 1. Now, since Experiment 1 cannot be performed over all key-IV pairs, we reasonably assume that Γ 1 , Γ 2 , . . . , Γ 7 are equally likely to occur in practice. Under this assumption, P r(k 0(7) = 0) falls to 6/7 = 0.8571. This probability is further reduced to 5/7 = 0.7143 if we are to additionally have c 1 = 0.
The timing analysis above assumes that S is in cache. This is a very realistic assumption for the following reason. In [8, Appendix A], the S-box S is implemented using two 8 × 8 lookup tables, viz., S0 and S1. Encryption performed many times on a single CPU would ideally result in the elements of these tables to be frequently accessed. And, every element of S0 and S1 could be expected to be accessed frequently if each encryption, in turn, invokes Algorithm 5 multiple times (i.e., long keystream is generated). This would ideally place the lookup tables in the cache.
Using statistical methods that do not involve any cache measurement:
The execution time of Algorithm 1 can also be estimated without performing cache measurements. Let us recall that Algorithm 1 is run 32 times during the initialization process (see Algorithm 3). Following this, Algorithm 2 is run once (along with constant time steps of Algorithm 4 and Algorithm 5) before the first 32-bit keystream word is output (see Algorithms 4 and 5). Now, the first step of Algorithm 2 is identical to the first step of Algorithm 1. The subsequent steps of Algorithm 2 are constant time operations.
8 Thereby, the total execution time till the first keystream word is generated is
where 1. T ′′′ const is the sum total of 32 · T ′ const and the constant-time steps of Algorithm 3; 2. λ j , j ∈ {0, 1, . . . , 31}, is the number of times end carry is generated in the (j + 1)th iteration of Algorithm 1;
T (w)
const is the sum total of the execution times of the constant time steps of Algorithm 2, plus the time to compute steps 1-3 of Algorithm 4 and steps 2-4 of Algorithm 5;
8 Throughout this paper, we ignore steps 3 and 4 of Algorithm 1 (and, naturally, steps 2 and 3 of Algorithm 2) because the event s16 = 0 occurs randomly with probability 2 −31 which is negligible when compared to the probability that end carry is generated exactly once. Besides, the step 4 of Algorithm 1 is just an assignment operation and consumes only a small fraction of the time it takes to perform one 32-bit AND and one 32-bit addition. Therefore, we can safely assume that steps 3 and 4 of Algorithm 1 have negligible influence on the timing analysis.
λ (w) 1
is the λ of the first run of Algorithm 2; 5. λ j , λ (w) 1 ∈ {0, 1, . . . , 5}, ∀j ∈ {0, 1, . . . , 31}.
Let us now try to estimate the mean of the λ's assuming the z-terms are uniformly distributed from iteration 17 of Algorithm 1 onwards. This assumption is very reasonable at iteration 17, when every LFSR element has been updated once, and the subsequent iterations. We performed Experiment 2 to determine the mean. We obtained the frequencies 12, 220, 792, 792, 220, and 12 for λ = 0, 1, 2, 3, 4, 5, and 6 respectively.
From these frequencies we obtain that the mean λ,
For the iterations 17-32 of Algorithm 1 and iteration 1 of Algorithm 2, the expected cumulative λ is 17 ·λ = 42.5. The cumulative λ (expected) can be computed for iterations 2-16, but in these computations one needs to make certain assumptions. This is because, in any iteration before the 17th, at least one of the z-vectors is composed of bits loaded directly from the key, IV and the d-constants. Assuming that the incoming carries at the bit position 30 are uniformly distributed can make the λ calculations erroneous. One may instead resort to simulations, but even then would have to perform extrapolations. For example, if the IV is unknown, then in iteration 2, to determine 62 ) time complexity, however, one can at the best perform a partial simulation and extrapolate the result. This means that there is always an error in computing the expected λ for each of the iterations 2-16. Hence, we can instead assume that the expected λ is λ for each of these iterations. This is also error-prone, but we can construct an appropriate credible interval to mitigate the error. This is done as follows. First, upon performing Experiment 2 with more z-(and hence c-) bits and observing the resultant frequencies (i.e., similar to those corresponding to (6)), we will observe that λ is near-normally distributed. Given this, we first choose a confidence level 9 (say, α) and construct a credible interval aroundλ. To reduce the error in assuming that the λ's of iterations 2-16 are also near-normally distributed, we widen the credible interval corresponding to α while maintaining that the confidence level is α.
Let λ min and λ max denote the lower and upper limits of the resulting credible interval aroundλ. Now, let us suppose that the attacker clocks the encryption up to the generation of the first keystream word. If this duration falls within the interval (see (5)):
then the attacker concludes that the λ for iteration 1 of Algorithm 1 is 1 (just like T ′ const and T carry , the attacker can measure T (w) const ). When this is the case, the attacker concludes that k 0(7) = 0 and c 1 = 0 with probability 5/7.
⊓ ⊔
Given that k 0(7) and c 1 are recovered, using
, we arrive at Theorem 1.
Theorem 1. When c 1 = 0 and k 0(7) = 0, we have:
with the '+' symbol denoting standard integer addition. . Now, we know that
where the '+' denotes standard integer addition. Solving the recurrence equation (9), we arrive at (8) . ⊓ ⊔
Complexities and Success Probabilities
The cache attack requires a few cache timing measurements for precision. If the S-boxes S0 and S1 are not in the cache, then Eve performs a few encryptions, using key-IV pairs of her choice, until the instant when Bob starts encrypting. We recall from Sect. 2 that the S-boxes are accessed twice in every iteration of Algorithm 5. From [8, Appendix A], we infer that 4 elements of S0 and S1 are used in every iteration of Algorithm 5. In the initialization mode, we have 32 similar iterations where F is computed and, hence, S0 and S1 accessed. Let η denote the number of iterations of Algorithm 5. Then, the total number of iterations per key-IV pair is 32 + 1 + η = 33 + η (includes one iteration of Algorithm 4). This translates to a total of 2 · (33 + η) (= η ′ , say) draws of elements from each of S0 and S1. Assuming that the draws are uniform and independent, the probability that every 8-bit S-box element appears at least θ times in the list of draws is given by:
where θ is the number of quickly successive RAM-fetches after which the concerned memory element is placed in the cache. The problem now is to find the smallest η such that the probability given by (10) is reasonably close to 1. We are not aware of any simple method to solve this problem. However, when η ′ = 256·θ, one expects that every element appears θ times in the list of uniform and independent draws. Given this, η = 128 · θ − 33. Therefore, the attack requires 128 · θ − 33 keystream words to be generated with one key-IV pair. The time cost is (128 · θ − 33) · T KGA + T ini , where T KGA is the execution time of one iteration of the keystream generating algorithm (i.e., Algorithm 5) and T ini is the initialization time. Alternatively, the attack can be performed with many key-IV pairs with each generating fewer keystream words. The time complexity in this case will obviously be higher than (128 · θ − 33) · T KGA + T ini . But since the attacker does not require the keystream words for the attack (so it is an asynchronous attack even in the stricter viewpoint of Osvik et al. [18] ), the data complexity is irrelevant here. Hence, we choose one key-IV pair and mount the attack in order to minimise its time complexity.
As an example, when θ = 100, the pre-computation phase of the single-(key, IV) attack is expected to require 2 13.64 · T KGA + T ini time. In practice, θ is such that the time complexity is not significantly larger than that for θ = 100, we believe. Besides, if the S-boxes are already in the cache, key recovery is almost immediate.
For the statistical timing attack, when the IV is unknown, the attack requires one 32-bit keystream word and the time needed to generate it. The success probability is less than 5/7 because of the errors caused by the approximations involved in the attack. While it seems extremely tedious to accurately compute the error, its magnitude can intuitively be made negligible by choosing a wide credible interval as stated earlier.
Implications of the Attacks to 128-EEA3
The 3GPP encryption algorithm 128-EEA3 is also a stream cipher that is built around ZUC [7] . It uses a 128-bit "confidentiality key" (denoted in [7] as CK) and encrypts data in blocks of size ranging from 1 bit to 20 kbits. Aside from the ZUC algorithm, the 128-EEA3 contains the following main steps.
Key initialization: The confidentiality key CK initialises the ZUC key in a straightforward manner as follows [7] .
Let CK = CK 0 ||CK 1 || . . . ||CK 15 , where each CK i is a byte. Then,
IV initialization: The IV of ZUC is initialised using three parameters of ZUC, viz., COU N T , BEARER and DIRECT ION . The parameter COU N T (= COU N T 0 ||COU N T 1 || . . . ||COU N T 4 , where each COU N T i is a byte) is a counter, BEARER is a 5-bit "bearer identity" token and DIRECT ION is a single bit that indicates the direction of message transmission [7] . Given these, the IV of ZUC is initialised as:
From (11), it trivially follows that the timing attacks on ZUC are also attacks on the 128-EEA3, with the corresponding bits of the confidentiality key CK being (partially) recovered. In other words, if bit k i(j) of the ZUC key is recovered then the bit CK i(j) of the 128-EEA3's confidentiality key is recovered as well.
Countermeasures
In the previous sections, we described timing weaknesses that are mainly attributable to the design/implementation flaws listed in Observations 1 and 2. Consequently, we see the following countermeasures for the attacks that stem from these weaknesses:
1. A constant-time implementation of the modulo (2 31 − 1) addition in software and hardware. 2. A more involved key loading procedure. Table 3 compares and contrasts the two countermeasures.
Of course, a conservative approach would be to complicate the key loading procedure as well as implement the modulo (2 31 − 1) addition as a constant-time operation.
For the key loading procedure, we suggest the following alternatives: At times, like in the case of ZUC, it may be easier to find a safe implementation -this point will become evident from the discussion to follow in this section Affects the performance only if the key is changed frequently Can affect the performance even if rekeying is rare The timing weaknesses of this paper still remain but cannot be exploited to recover the key or key-dependent information
The timing weaknesses of this paper are removed; any other similar timing analysis, however, poses a risk of a straightforward (partial) key recovery 1. Applying a secure hash function to the s i 's of (1) to the S-boxes (call them B i , i ∈ {0, 1, . . . , 15}) are the s i 's of (1). When the S-boxes are all secret, N = 31 can suffice even though at least 15 input bits are known constants. This is because (i) S-boxes are secret, and (ii) S-boxes with outputs larger than inputs can still accomplish Shannon's confusion [22] (note that Shannon's diffusion, as interpreted by Massey in [15] , does not apply to stream ciphers) [1] .
Recall that the timing attacks of Sect. 4 can recover only one bit of B 0 (s 0 ) and some information on 6 other bits. While these may be improved in the future (directions for this are provided in Sect. 6) to possibly recover more key bits, recovering an entire 31-bit block seems far-fetched. Actually, with the use of secret S-boxes it is no longer possible, in the first place, to perform the exact same analysis as in Sect. 4. This is because we will have unknown bits in place of the d i(j) 's that constitute the MSBs of z 1 , z 2 , . . . , z 6 (see Sect. 4). Therefore, even upon making precise timing measurements, the attacker will very likely have to guess the bits in place of the d i(j) 's before trying to determine the bits in the LFSR. The attacker can, given precise timing measurements, find the number of 0's in [z 1(30) z 2(30) . . . z 6(30) ], but is unlikely to be able to ascertain which bits are 0's. For example, the six z-bits being uniformly distributed (given that the S-boxes are secret) and 10 There are, however, preimage attacks on step-reduced SHA-512 (see e.g. [2, 13] ). The best of these, due to Aoki et al. [2] , works on 46 steps (out of the total 80), has a time complexity of 2 511.5 and requires a memory of about 2 6 words.
the carries into the bit position 30 being distributed close to uniformly (see footnote 4), there is about 2 −7.42 probability that there is no end carry.
Placing entire lookup tables in the cache, by the legitimate party, prior to encryption -a process known as cache warming -is suggested as a countermeasure in [19] . Let us suppose that our lookup tables fit completely into the cache. Further let us assume that the adversary's instructions or system processes do not evict the contents of these tables. Then, by placing the tables in the cache, one ensures that all S-box accesses are cache hits. Cache attacks that exploit the difference in the register loading time between a cache hit and a miss are precluded by this process. As we do not mount such cache attacks in this paper, cache warming is not a useful countermeasure here. Nomenclature: To facilitate future reference, we label some of the above, secure modifications of ZUC in Appendix B.
Conclusions
In this paper, we have presented timing attacks on the stream cipher ZUC that recover, under certain practical circumstances, one key bit along with some keydependent information with about 0.7 success probability and negligible time, memory and data. To the best of our knowledge, these are the first attacks on the ZUC cipher of Version 1.5. The following are other highlights of this paper.
-This is one of the very few and early papers analysing the cache timing resistance of stream ciphers. As noted in [14] , block ciphers (mainly the AES) have been the prominent targets of cache timing attacks. Besides, cache timing analyses of stream ciphers are recent additions to the cryptanalysis literature, with the first paper (viz., [26] ) being published as late as 2008 [14] . -The statistical timing attack is novel, to the best of our knowledge.
-The timing attacks of this paper warn that algorithms must be designed or implemented to resist single-round/iteration timing weaknesses. This single round can even belong to the key/IV setup of the cipher.
The weaknesses we have found that lead us to the attacks may be certificational. Nonetheless, we see a possibility for improving the attacks to recover a few other key bits by, for example, examining the cases where end carry is generated twice.
We have also proposed modifications to ZUC that resist not only the initiatory timing attacks but, evidently, also their potential improvements suggested above. Analysis of these new schemes comes across to us as an interesting problem for future research.
Update. The timing analyses and countermeasures presented above were privately communicated to the ETSI/SAGE before the 2nd International Workshop on ZUC Algorithm and Related Topics, and consequently the software implementation of ZUC was modified [10, Sect. 12.9] . The latest C code (see Version 1.6 [9, Appendix A]) implements the AddC() subroutine that we have suggested in Sect. 5), in place of the variable-time Add(). The ZUC specification with this revised C code has been approved by the 3GPP for inclusion in the LTE standards (see [10, Sect. 12.9] and [11] ).
A Practical Occurrences of Γ 3 , Γ 5 and Γ 6 Table 4 provides some example key-IV values that produce the favourable Γ -vectors (i.e., the vectors that generate end carry exactly once). 
B ZUC Modifications
We list our proposed algorithm/implementation modifications in Table 5 . Table 5 . ZUC modifications: To each label we suffix a '+' if one or more of the generic countermeasures suggested by Osvik et al. in [18] are applied
Label
Reference ZUC-1.5C Constant-time software implementation of modulo (2 31 − 1) addition (i.e., implementation of Version 1.5 with AddC() replacing the variable-time Add()) ZUC-1.5H Involved key loading: hash function ZUC-1.5S Involved key loading: S-boxes ZUC-1.5CH Constant-time implementation of modulo (2 31 −1) addition along with involved key loading using a hash function ZUC-1.5CS Constant-time implementation of modulo (2 31 −1) addition along with involved key loading using S-boxes
