The demand for more efficient ciphers is a likely to sharpen with new generation of products and applications. Previous cipher designs typically focused on optimizing only one of the two parameters -hardware size or speed, for a given security level. In this paper, we present a methodology for designing a class of stream ciphers which takes into account both parameters simultaneously. We combine the advantage of the Galois configuration of NLFSRs, short propagation delay, with the advantage of the Fibonacci configuration of NLFSRs, which can be analyzed formally. According to our analysis, the presented stream cipher Espresso is the fastest among the ciphers below 1500 GE, including Grain-128 and Trivium.
Introduction
The importance of designing efficient and secure cryptographic systems is hard to overestimate. On one hand, with the growth of Internet-of-Things applications, more everyday-life products become security-critical and require high levels of assurance. On the other hand, these products typically have very limited resources available for the implementation of security mechanisms. In addition, the required computational effort and data rates are expected to significantly increase with new generations of products. The 5G is envisioned to have 1000 times higher traffic volume compared to current LTE deployments while providing a better quality of service [1] . Consumer data rates of hundreds of Mbps are expected to be available in a general scenario and multi-Gbps in specific scenarios [2] . Furthermore, 5G needs to support a low latency of a few milliseconds to address use cases such as safety or control mechanisms in the process industry, in the electrical-distribution grid, or for traffic safety [2] .
To design a secure cipher which satisfies requirements of the most demanding products and applications, we have to find a best trade-off between hardware size and speed for a given security level. Previous stream cipher designs have either too high propagation delay (e.g. Grain [3] ) or use too many flip-flops (e.g. Trivium [4] ) for a given security level. Thus, they optimize only one of the two important parameters -hardware size or speed. In this paper, we present a methodology for designing a class of stream ciphers which takes into account both parameters simultaneously, thus minimizing the hardware footprint and maximizing the throughput of the design. We combine the advantage of the Galois configuration of NLFSRs, short propagation delay, with the advantage of the Fibonacci configuration of NLFSRs, which can be more easily analyzed formally. A careful choice of taps for the output Boolean function allows us to perform a security analysis for linear approximation attacks. According to our evaluation, the presented stream cipher is the fastest among the ciphers below 1500 GE, including Grain-128 and Trivium.
The paper is organized as follows. Section 2 gives basic notation used in the sequel. Section 3 describes previous work. Section 4 presents the new stream cipher Espresso. Section 5 analyses its hardware cost. Section 6 presents the security analysis. Section 7 concludes the paper.
Preliminaries
Throughout the paper, we use "⊕" and "·" to denote addition and multiplication in GF (2) , respectively.
The Boolean functions GF (2 n ) → GF (2) are represented using the Algebraic Normal Form (ANF) which is a polynomial over GF (2) [5] .
An n-bit Feedback Shift Register (FSR) consists of n binary storage elements, called stages. Each stage i ∈ {0, 1, . . . , n − 1} has an associated state variable x i which represents the current value of the stage i and a feedback function f i : GF (2 n ) → GF (2) which determines how the value of i is updated.
A state of an FSR is a vector of values of its state variables. At each clock cycle, the next state of an FSR is determined from its current state by simultaneously updating the value of each stage i to the value of the corresponding feedback function f i , ∀i ∈ {0, 1, . . . , n − 1}.
The period of an FSR is the length of the longest cyclic output sequence it produces. If all feedback functions of an FSR are linear, then it is called a Linear Feedback Shift Register (LFSR). Otherwise, it is called a Non-Linear Feedback Shift Register (NLFSR).
An FSR can be implemented either in the Fibonacci or in the Galois configuration [6] . In the former, the feedback is applied to the input stage of the shift register only. All remaining feedback functions are of type f i = x i+1 , for i ∈ {0, 1, . . . , n−2}. In the latter, the feedback can potentially be applied to every stage. Thus, the Fibonacci configuration is a special case of the Galois configuration. Due to its conceptual simplicity, the Fibonacci configuration has been studied much more thoroughly.
Two NLFSRs are called equivalent if sets of their output sequences are equal.
Previous work
For encryption purposes, there are two types of ciphers, namely block and stream ciphers. Block ciphers have been studied for over 50 years [7] . Collected knowledge about their design and cryptanalysis made it possible to develop the Advanced Encryption Standard (AES) algorithm which is widely accepted and has strong resistance against various kind of attacks [8] .
On the other hand, an active public investigation of stream ciphers began only about 20 years ago [9] . A common type of stream cipher is the binary additive stream cipher, in which the keystream, the plaintext, and the ciphertext are binary sequences. The keystream is produced by a keystream generator which takes a secret key and an initial value (IV) as a seed and generates a long pseudo-random sequence of 0s and 1s. The ciphertext is then obtained by the bit-wise addition of the keystream and the plaintext.
In the eSTREAM initiative, many stream ciphers, including Grain [3] and Trivium [4] , were designed following the belief that stream ciphers can be made both faster and smaller than block ciphers. In recent years, however, several block ciphers have been presented which are comparable in size to Grain and Trivium. Some well-known examples include KATAN [10] , LED [11] , KLEIN [12] , PRESENT [13] , Piccolo [14] and TWINE [15] . The throughput for these is often given for 100KHz clock frequency since this is typical for RFID tags [16] . Yet, they can often be clocked faster and [17] reports some implementations reaching about 1Gbps using slightly more than 3000 GE and 90nm CMOS technology. For higher throughput and more compact design it appears that stream ciphers are the best choice.
However, the confidence in stream ciphers' security has been tapered by many broken systems. For example, the popular stream ciphers A5/1 and A5/2 used in the Global System for Mobile communications (GSM) standard and E0 used in Bluetooth have been found susceptible to a number of attacks [18] . As a result, A5/1 was replaced by a block cipher based A5/3 and A5/2 was prohibited. Another well-known stream cipher RC4 used in the original IEEE 802.11 standard to secure wireless networks has been shown especially vulnerable when the beginning of the output keystream is not discarded, non-random or related keys are used, or a single keystream is used twice [19] . As a result, it was replaced by the AES in the newest standard, IEEE 802.11i.
The confidence in stream ciphers is typically higher and their acceptance is faster if they are built from well-defined components whose security can be formally analyzed. One of the most studied components for stream ciphers is a filter generator which consists of an FSR [20] and a nonlinear output function taking its inputs from the selected stages of the FSR. It is known how to make design choices for the size of internal state and the output function (number and position of inputs, nonlinearity, resiliency, algebraic degree, etc) so that the resulting filter generator is resistant to known attacks with a sufficient security margin [21] [22] [23] . Techniques that can guarantee long FSR period are also known [24, 25] .
Examples of stream ciphers based on the idea behind filter generators include Grain [3] and Trivium [4] .
Design description
This section motivates and describes the presented stream cipher Espresso.
Design methodology
The Grain family of stream ciphers [3] uses FSRs in the Fibonacci configuration. This adds simplicity to the security analysis, but has a drawback that the propagation delay through the feedback function is large due to the large size of the function. Trivium [4] uses much simpler feedback functions, but it has 288 flip-flops which are more area consuming compared to gates.
First, by using FSRs implemented in the Galois configuration, we can make the feedback functions smaller. This allows us to reduce the propagation delay compared to Grain while at the same time decrease the size compared to Trivium. Due to the large number of feedback functions in the presented design, its maximum degree of parallelization cannot be made as high as in both Grain and Trivium. Still, by carefully choosing feedback functions, we are able to guarantee the maximum degree of parallelization 4 and a maximum-length FSR.
Second, to enable security analysis of the presented design, we transform the original Galois NLFSR to an NLFSR whose configuration resembles the Fibonacci configuration. The core idea of our method is to assure that all of the most biased linear approximations of the output Boolean function take inputs only from those stages of the Galois NLFSR which have a corresponding equivalent stage in the transformed NLFSR. As a result, traditional cryptanalysis techniques can be applied to our design as well.
Design details
The two main building blocks of Espresso are a 256-bit NLFSR G in the Galois configuration and a 20-variable nonlinear output function. To avoid confusion between the feedback functions of G and the feedback functions of the transformed NLFSR F introduced later, we denote a feedback function of the stage i of G by g i , for all i ∈ {0, 1, . . . , 255}.
The feedback functions of the NLFSR G are specified as follows:
All remaining feedback functions of G are of type
The output function z(x) is specified as follows:
The function z(x) consists of a linear function of 6 variables and a bent function of 14 variables. Therefore, z(x) is balanced, has nonlinearity 2 6 (2 13 − 2 6 ) = 520192 and resiliency 5. The algebraic degree of z(x) is 6 since its largest ANF monomial contains 6 variables.
In NLFSRs in the Fibonacci configuration are much more studied and cryptanalyzed compared to the NLFSRs in the Galois configuration. To make use of the accumulated knowledge, we can transform the NLFSR G into an equivalent NLFSR F whose configuration resembles the Fibonacci configuration (see Fig. 1 ) and therefore is easier to cryptanalyze. The NLFSR F has only two non-trivial feedback functions:
and all remaining feedback functions are of type f i (x) = x i+1 . We can see that the function (1) which is induced by the function f L (x). It is known that NLFSRs constructed in this way have the period 2 n − 1 where n is the size of the state [24] .
The equivalence of G and F can be shown by applying the Fibonacci-to-Galois transformation [6] . The set of sequences generated by the stage 231 of G is equivalent to the set of sequences generated by the stage 255 of F . The set of sequences generated by the stage 193 of G is equivalent to the set of sequences generated by the stage 217 of F . Since G is equivalent to F , its period is 2 256 − 1.
The function f 255 (x) is balanced, has nonlinearity 2 6 (2 11 − 2 5 ) = 129024, resiliency 5, and algebraic degree 4. Since F and G are equivalent, the function g 231 (x) of G has the same properties.
Indexes of variables of f L (x) form the full difference set 
Key and IV initialization
The cipher Espresso is initialized as follows. Let k i denote the bits of the key k, 0 ≤ i ≤ 127, and I V i denote the bits of the initialization value I V , 0 ≤ i ≤ 95. The key and IV bits are loaded into the shift register as follows:
The initialization phase consists of clocking the cipher 256 times, XORing the produced output bit with the stages x 255 and x 217 . Thus, in this phase the feedback functions g 255 (x) and g 217 (x) of the NLFSR G are given by
After initialization, the cipher is clocked for three more cycles (due to the pipelining of the output function and additional logic required for switching between the initialization and the keystream generation phases, as explained in Section 5) and then the keystream is produced.
Hardware cost analysis
In order to reduce the propagation delay of the circuit implementing the output function z(x), we can pipeline it as follows:
A circuit diagram implementing the pipelined version of z(x) is showed in Fig. 2 . As a consequence of the pipelining, the output of the stream cipher is delayed by two clock cycles, increasing the latency. In addition, the pipelining increases the area by 8 flip-flops. However, it allows us to increase the throughput 1.7 times. In our opinion, the substantial gain in throughput outweighs the minor increases in areas and latency.
In order to further reduce the propagation delay of the presented design, we apply de Morgan rule to re-express ANFs of the feedback functions g 235 and g 197 of the NLFSR G as follows:
where x denotes the Boolean complement of x (defined as x = x ⊕ 1), and "+" denotes the Boolean OR. From Table 1 , the reader can see that, in CMOS technology, NAND or NOR are much smaller and faster than AND. Therefore, we can decrease both, the area and the delay, by replacing a 4-input AND as shown above.
Finally, we need to take care of the propagation delay of the feedback functions g 255 and g 217 during the initialization phase. In this phase, the functions g 255 (x) and g 217 (x) are computed as Figure 3 shows how switching between the initialization and the keystream generation phases can be implemented for g 255 without increasing the critical path (a circuit for the function g 217 is similar). The output of z(x) needs to be multiplexed and pipelined. Note that while the function describing a regular 2-input multiplexer (MUX) is a · b + a · c, a multiplexer in which one input is fixed to 0 can be implemented as
Therefore, an AND gate can be used to implement the multiplexing of z(x), as shown in Fig. 3 . Since the delay of an AND is smaller than the delay of an XOR, the proposed switching scheme does not increase the overall delay. However, it increases the latency by one clock cycle.
After these modifications, the NLFSR G requires 12 2-input ANDs, 4 2-input NANDs, 2 2-input NORs, 19 2-input XORs and 256 flip-flops to be implemented. The output function z(x) requires 8 2-input ANDs, 2 3-input ANDs, 13 2-input XORs and 8 flip-flops. The Table 1 for parameters of gates), then we can approximate the area and the propagation delay of the presented stream cipher Espresso as Area of (22 2 We assume that a logic similar to the one shown in Fig. 3 is used for switching between the initialization and the operational phases. Otherwise, the delay of Grain-128 is considerably higher. Grain-128 can be parallelized to produce up to 32 bits per clock cycle. For the degree of parallelization one, its latency is 296 ns (computed as (128+256+1) clock cycles × 768 ps). For the stream cipher Trivium [4] we have:
Area of (3 ANDs + 11 XORs + 288flip-flops) = 5597μm 2 = 1513GE Delay of (AND + 2 XORs + flip-flop) = 538ps.
Trivium can be parallelized to produce up to 64 bits per clock cycle. For the degree of parallelization one, its latency is 663 ns (computed as (80+4×288) clock cycles × 538 ps). Note that, in all three ciphers, the latency can be reduced if the key and IV are loaded in parallel rather than sequentially. However, such a technique requires the addition of a MUX to each flip-flip of the FSRs, implying the increase in area by at least 3 GE × FSR size. Tables 2, 3 and 4 summarizes the area and throughput of the three ciphers for the degrees of parallelization 1, 2 and 4. For the degree of parallelization 1, Espresso is 3.5 % larger and 71 % faster than Grain-128. Its latency is 22 % smaller than the one of Grain-128. Compared to Trivium, it is 1 % smaller, 19 % faster, and has 65 % smaller latency. We can see that Espresso is the fastest among the designs below 1500 GE.
Security analysis
This section will give a security analysis of the presented stream cipher Espresso. Both attacks on the running key stream and attacks on the initialization procedure are discussed.
Linear approximations
Attacks using linear approximations were successful against the initial version of Grain, resulting in key recovery attacks. Being an NLFSR with a nonlinear output function, the current design has similarities with Grain. This makes it important to determine the resistance against these attacks. The security against linear attacks will be analyzed using the equivalent transformed configuration F of the shift register G. Note that there are no linear terms in any shift register stages that do not have an equivalent in both configurations, so the analysis is valid also for the Galois configuration. For clarity of the presentation, we divide the state register into two separate parts. The state variables in the nonlinear part (Shift Register 1 in Fig. 1 ) are denoted b and the state variables in the linear part (Shift Register 2 in Fig. 1 ) are denoted by s. Furthermore, let B and S denote the size of the nonlinear part and the linear part of the shift register respectively. Thus, we have Grain-128 which is induced by the polynomial (1) . Define the bias ε of an approximation as ε = 2 · Pr(X = Y ) − 1, simply written as X ε = Y . Then, the nonlinear output function can be approximated with a linear function and we can write
Denote the number of b-variables in the output function w b (z), i.e., 0
Similarly, the nonlinear feedback function can be approximated by a linear function in bits from s since there are no b-variables in the feedback in order for the nonlinear compensation to s S−1 to work properly. Thus, we can also write
Combining (3) and (4), we can write the output as a sum of only variables from the linear part of the shift register,
where the piling-up lemma has been used to combine linear approximations. Thus, an output variable can always be written as a biased sum of s-variables, which in turn satisfy a linear recurrence relation. If we denote the weight of this recurrence relation by w(LR), we get a distinguishing attack with total bias
From this it is clear that the complexity of the attacks relies on the biases of the two approximations and on the number of b-variables that are used in the linear approximation of the output function. Looking at the design, we have ε 1 = 2 −7 and ε 2 = 2 −6 and w b (z) = 6 for all biased linear approximations. From this it follows that the approximation (6) has bias 2 −43 which makes an attack similar to the one in [26] inefficient. Also, if we use a weight 3 multiple of the linear recurrence relation the number of samples needed would be in the order of 1/ε 2 tot = 2 43·3·2 = 2 172 (with distance 2 218/2 = 2 109 between first and last keystream bit in each sample [27, 28] ).
Algebraic attacks
Algebraic attacks have been proved very efficient against nonlinear combiners with or without memory [29, 30] . The success of the attack is due to the linearity of the shift register and the fact that the output function is the only nonlinear part of the register. It is always possible to write equations describing output bits using initial state bits. Due to the linearity of the shift register, the algebraic degree of these equations will never exceed the degree of the output function. With enough equations, linearization, or other more advanced methods [31] [32] [33] , can be used to recover the internal state. Moreover, annihilators [34] can be used to lower the degree of the functions even more. With a part of the state being nonlinearly updated, these attacks are no longer applicable since several nonlinear register stages are used in the output function. The degree of the equations in initial state bits will increase and is not limited by the degree of the output function.
Time-memory-data trade-off attacks
TMTO attacks on stream ciphers can be divided into two categories, those that attempt to reconstruct the internal state, see e.g., [35] [36] [37] and those that attempt to recover the key, see e.g., [38] . The algorithms used in the latter attacks are the same as those in the former, they just use a different one-way function as target of the attack. The algorithm used in [35, 36] simply records input/output combinations and uses enough data in order to have a collision with a recorded value. The trade off curve is given by T M = N , T = D, and P = M = N/D. The algorithm used in [37] instead created tables similar to those used by Hellman in [39] and has the trade-off given by T M 2 D 2 = N 2 , 1 ≤ D 2 ≤ T and P = N/D. Both algorithms uses the observation that an increased amount of data can lower the precomputation time. Since the size of the internal state is 2 2k , it is clear that recovering the internal state is not possible with T < 2 k and M < 2 k using any of the algorithms. On the other hand, recovering the key would be possible with e.g., T = 2 112 , M = 2 112 and D = 2 56 but will require a precomputation time of P = 2 168 . Some might argue that this would be a valid (academic) attack while some would claim that P = 2 168 is too large to be interesting when key size is 128 bits.
Ad hoc improvements to the TMTO attacks can also be considered, where recovering a subset of bits will allow recovering other bits as well using algebraic relations in the output function. The success of these attacks are specific to the design, in particular to the output function chosen in the design. The idea, as proposed in [40, 41] and demonstrated on the Grain family of stream ciphers, is to identify a subset of state bits, which together with some output bits can be used to determine the remaining state bits. Using this observation, the TMTO attack can be improved by only considering the subset of state bits needed for recovering the rest. The normality of the output Boolean function will here play an important role as it determines how many shift register bits need to be fixed in order to recover remaining state bits. The normality order of this function in the design is 7, which means that 14 − 7 = 7 variables need to be fixed in order to get linear equations for the recovery. The Galois configuration of the shift register G, together with the fact that not all bits have a corresponding bit in the transformed equivalent register F , will complicate this attack.
Still, we do not rule out that some improvement over the generic TMTO attacks are possible using this approach. However, the required memory complexity of such an attack will far exceed that of brute force and a parallelized brute force [42] is likely to be much more efficient.
Chosen IV attacks
The complexity of the initialization function does not affect the attack complexities in the TMTO attacks. In this section we consider attacks that do depend on the initialization function. In a chosen IV scenario, the adversary can choose the initialization vector used in the initialization step. This is the basis for the Cube attack [43] and AIDA attack [44] and can lead to key recovery if the initialization is not carefully designed. The number of iterations in the initialization should be chosen such that all key and IV bits affect the keystream bits in a complex way.
To determine the resistance against these types of attacks, maximum degree monomial tests have been performed. Any keystream bit can be written as a function of key and IV bits
All key bits are fixed to zero and a subset of the IV bits are fixed as well. Thus, running through all possible combinations of the non-fixed bits, the truth table of the function f i is obtained, which can in turn be used to compute the ANF. This will lead to a d-monomial test [45] as we could check the presence of monomials of degree d and compare it to the expected number for a random Boolean function. Intuitively, the maximum degree monomial only exists if all bits have been properly mixed by the initialization function, so we focus on this monomial. The total number of bits that can be used is 96 requiring a complexity of 2 96 in order to determine the presence of the monomial iv 0 , . . . , iv 95 . This is not feasible, and we instead adopt the test in [46] in order to find a monomial with manageable degree and that will be absent for as many initialization rounds as possible. The algorithm starts with just a few bits and exhaustively finds the monomial that is absent the maximum number of rounds. Then it greedily adds one more bit to the set and continues. All non-used key and IV bits are set to zero. For a conservative estimate the algorithm is allowed to use also key bits. This turns the Chosen IV attack into a less powerful nonrandomness detector since an attacker is not assumed to be able to choose key bits. Figure 4 shows the number Fig. 4 The maximum number of initialization rounds that do not pass a maximum degree monomial test for a given monomial degree of initialization rounds that can be broken using a particular degree (bit set size) for the monomial. By using dedicated hardware it would be possible to test a larger number of IV bits, i.e., larger degree monomials. However, from the results in Fig. 4 we deduce that the number of initialization steps is adequate to resist these types of chosen IV attacks. With 159 rounds that fail the nonrandomness test, we conclude that the proposed 256 rounds provide an adequate security margin. For a comparison, this test applied to Grain-128 can find nonrandomness in about 240 initialization rounds with bit set size 23. Using bit set size of 40 the full Grain-128 initialization using 256 shows nonrandomness.
Differential attacks
Differential attacks have been applied to several stream ciphers. A discussion on these attacks, together with several attacks on well known stream ciphers can be found in [47] . Different types of differentials can be targeted. Though it is possible, under some circumstances, to exploit differences including the internal state (see [47] ), the most useful differential is ( Key, I V ) → ( keystream). Such a differential could be used to mount a straight forward related key or chosen IV distinguishing attack and the number of initializations needed will depend on the probability of the differential. To get an indication of how differentials in the key and IV propagates through the initialization function in Espresso, all differentials ( Key, I V ) with Hamming weight 1 have been simulated. Given a difference in ( Key, I V ), a difference in the first output bit is seen as Bernoulli distributed. For a uniformly random process we expect p = 0.5. For k initialization rounds, we simulate the probability p for each possible weight 1 differential and record the largest k, denotedk such that p = 0.5 ± 5 · σ , where σ is the standard deviation for a uniform process. 1000000 samples for each differential and each k was used. For all differentials, we get 31 ≤k ≤ 72. While we only tested a small amount of all possible differentials, this shows that the initialization process seems to have a large margin for the propagation of these.?
Weak keys
Since the the period of Espresso is 2 256 − 1, it has only one fixed point state -the all-zero state. Therefore, the probability of a weak key is 1/2 256 , i.e. negligible.
Conclusion
We presented a new stream cipher Espresso targeting 5G wireless communication systems. Its 1-bit per cycle version has 1497 GE area, 2.22 Gbits/sec throughput and 232 ns latency, meeting requirements of most 5G applications envisioned today. It is resistant to known attacks, including linear approximations, algebraic attacks, time-memory-data trade off attacks, chosen IV attacks, differential attacks and weak key attacks.
Appendix: Test vectors
In producing the text vectors, the byte order is treated as taking the Least Significant Bit (LSB) first. Thus the LSB of the first byte of the key is index 0 of the state, the LSB of the second byte is index 8, etc. Similarly, for the IV the LSB of the first byte is index 128 of the state, etc. The keystream bytes KS are also filled with LSB first. The treatment of bits presented above is made with the purpose of interpreting the Appendix test vectors only. The stream cipher is bit-oriented and if a certain application finds in more appropriate or efficient to treat the order of bits differently when assigning bytes, it will still a valid use of the cipher. In such a case, the application has to define how it treats the bit order.
Test vector 1:
key[0],
