Abstract-A common approach to protect confidential information is to use a stream cipher which combines plain text bits with a pseudo-random bit sequence. Among the existing stream ciphers, Non-Linear Feedback Shift Register (NLFSR)-based ones provide the best trade-off between cryptographic security and hardware efficiency. In this paper, we show how to further improve the hardware efficiency of Grain stream cipher. By transforming the NLFSR of Grain from its original Fibonacci configuration to the Galois configuration and by introducing a clock division block, we double the throughput of the 80 and 128-bit key 1bit/cycle architectures of Grain with no area penalty.
However, unlike the LFSR case in which the mapping from the Fibonacci configuration to the Galois configuration is oneto-one, in the NLFSR case multiple Galois NLFSRs can be equivalent to a given Fibonacci one [9] . The problem of selecting a "best" Galois NLFSR for a given Fibonacci one is still open. One of the contributions of this paper is finding the minimal-throughput Galois configurations of NLFSRs for Grain-80 and Grain-128 [10] . Another contribution is the introduction of the clock division block which divides the clock frequency of Grain by two or four during the initialization phase. Without such a block, the potential benefits of the Galois configuration can not be utilized.
II. BACKGROUND

A. Definition of NLFSRs
A Non-Linear Feedback Shift Register (NLFSR) consists of n binary storage elements, called bits. Each bit i ∈ {0, 1, . . . , n − 1} has an associated state variable x i which represents the current value of the bit i and a feedback function f i : {0, 1} n → {0, 1} which determines how the value of i is updated.
A state of an NLFSR is an ordered set of values of its state variables. At every clock cycle, the next state is determined from the current state by updating the values of all bits simultaneously to the values of the corresponding f i 's. The output of an NLFSR is the value of its 0th bit.
If for all i ∈ {0, 1, . . . , n − 2} the feedback functions are of type f i = x i+1 , we call an NLFSR the Fibonacci type. Otherwise, we call an NLFSR the Galois type.
Two NLFSRs are equivalent if their sets of output sequences are equal.
B. The Transformation from the Fibonacci to the Galois Configuration
Let f i and f j be feedback functions of bits i and j of an n-bit NLFSR, respectively. The operation shifting, denoted by f i P → f j , moves a set of product-terms P from f i to f j . The index of each variable x k of each product-term in P is changed to x (k−i+j) mod n .
The terminal bit τ of an n-bit NLFSR is the bit with the maximal index which satisfies the following condition: For all bits i such that i < τ , f i is of type f i = x i+1 .
Definition 1: An n-bit NLFSR is uniform if the following two conditions hold:
(a) all its feedback functions are singular functions of type where g i does not depend on x (i+1)mod n , (b) for all its bits i such that i > τ , the index of every variable of g i is not larger than τ . Theorem 1: [9] Given a uniform NLFSR with the terminal bit τ , a shifting g τ P → g τ , τ < τ , results in an equivalent NLFSR if the transformed NLFSR is uniform as well.
III. THE DESCRIPTION OF GRAIN
There are two versions of Grain: 80-bit [4] key and 128-bit key [10] . Both consist of an LFSR, an NLFSR, and two combining functions.
In Grain-80 the shift registers are 80-bits. They are both the Fibonacci type, i.e. all bits except the 79th repeat the value of the previous bit. The feedback function of the 79th bit of the LFSR is given by:
where s i is the state variable of the ith bit, i ∈ {0, 1, . . . , 79}.
The feedback function of the of the 79th bit of the NLFSR is given by: where b i is the state variable of the ith bit, i ∈ {0, 1, . . . , 79}.
The first combining function of Grain-80 produces it output value based of the selected bits from the NLFSR and the LFSR: The second combining function of Grain-80 generates the output stream of the system from the selected bits from the NLFSR and LFSR states and the output of H:
where A = {1, 2, 4, 10, 31, 43, 56}.
For Grain-128, the corresponding functions are:
where A = {2, 15, 36, 45, 64, 73, 89}.
Before generating a stream of data, a cipher must be initialized with default keys. During the initializing phase the cipher does not produce any output for 160 clock cycles for Grain-80 and 256 cycles for Grain-128. The output of the Z function is XOR-ed with the outputs of LFSR and NLFSR and then fed into the inputs of both shift registers, as shown in Figure 1 . After the initialization, the loops are opened and there is no feedback between the two shift registers.
It is possible to increase the throughput of Grain at the expense of extra hardware by introducing parallelism in its architecture. In parallelized versions of Grain, in each clock cycle blocks of duplicated NLFSR and LFSR feedback functions produce output bits in parallel. To allow for up to 16 (32) degrees of parallelization, Grain-80 (128) is designed so that the bits 65 < i < 79 (97 < i < 127) of the shift registers are not used in the feedback functions or in the input to the combining functions.
IV. GRAIN WITH GALOIS CONFIGURATION Grain can be modified by transforming its LFSR and NLFSR from their original Fibonacci configurations to the Galois configurations. The transformation of LFSRs is done using standard techniques, in this section we only describe the transformation of NLFSRs.
The NLFSR of Grain-80 (128) can be transformed to the Galois configuration by shifting the product-terms of the feedback function of 79th (127th bit) to the feedback functions of bits with lower indexes. By Theorem 1, if the NLFSR after shifting satisfies the conditions of the Definition 1, then it produces the same sets of output sequences as the NLFSR before shifting.
Ideally, in order to maximize the throughput, we want to distribute the products equally among feedback functions. However, according to [9] , to guarantee equivalence of NLFSRs before and after shifting, we cannot shift to bits with indexes lower that the bit τ which is given by:
where P is the set of all product-terms of the feedback function of the Fibonacci NLFSR, and min index(p) (max index(p)) denotes the minimal (maximal) index of variables the productterm p.
For Grain-80, the product-term with the maximal difference in indexes of variables is b 63 b 45 b 28 b 17 b 9 , so τ = 54. For Grain-128, we have τ = 64 due to the product-term b 3 b 67 .
However, in order to avoid modifications of the encrypting algorithm of Grain, we need to guarantee not only the equivalence of the sequences of output bits, but also the equivalence of the sequences of of all internal bits of the NLFSR used by the combining functions. A modification of the encrypting algorithm could lead to undesirable changes in the Grain security. For Grain-80, the bit 63 of the NLFSR is used in the function H, and bits 1, 2, 4, 10, 31, 43, 56 are used in the function Z. Since 56 and 63 are greater than 54, we cannot use τ = 54 as the terminal bit of the Galois configuration. We need to set the terminal bit to 63. Then, for all bits i ∈ {0, 1, . . . , 62}, the feedback functions will be of type g i = b i+1 , an the output sequences of the bits i ∈ {1, 2, . . . , 63} will be the same as the output sequence of the bit 0 shifted in time. Consequently, the algorithm of Grain will not change.
For Grain-128, the bits 12 and 95 of the NLFSR are used in H and the bits 2, 15, 36, 45, 64, 73, 89 are used in Z. Therefore, the terminal bit has to be 95.
After we have chosen the position of the terminal bit, we can start shifting products from the function g 79 (g 127 ) to the functions with indexes larger or equal than the terminal bit. Shifting can be done in many different ways. At present there is no systematic technique which guarantees that the transformation produces an NLFSR with the minimal throughput for a given technology. We found the solutions presented below by trying many different choices. Here and further in this section, all omitted feedback functions are of type g i = b i+1 .
A. One Bit per Cycle Version
For Grain-128, the maximal-throughput Galois NLFSR is:
B. Multiple Bit per Cycle Version
We can extend the theory presented in [9] , [11] to k bits/cycle versions of Grain by restricting bit positions to which the feedback can be applied. It is easy to see that, to ensure times k degree of parallelization of an n-bit Galois NLFSR with the terminal bit τ , all bits except
should have feedback functions of type g = b (i+1) . So, for example, for 4 bits/cycle version of Grain-80, we can apply feedback to the bits 79,75,71 and 67: 
For 8 bit/cycle version of Grain-128, we can apply feedback to the bits 127,119,111,103:
For 16 bit/cycle version of Grain-128, we can apply feedback to the bits 127 and 111.
For 32 bit/cycle version of Grain-128, we can apply feed-back only to the bit 127, i.e. no transformations can be done.
C. Design Details
By transforming Grain's shift registers to the Galois configuration as described in the previous section, we can obtain up to 40 % reduction in their propagation time. However, the clock frequency of the overall Grain system improves only about 10%. The problem is in the hardware architecture of Grain during key initialization, during which the output value of Z(x) is fed back to the LFSR and NLFSR making two loops, as shown in Figure 1 . After the transformation from the Fibonacci to the Galois configuration, due to the reduction of the critical path in the NLFSR, the critical path of the system is no longer in the NLFSR but is in the initialization loops, which are closed only during initialization. Thus, the highest frequency that the system supports during initialization is lower than the highest frequency supported during key stream generation (see Table I ). To obtain a higher improvement in the throughput of Grain, we introduce a clock division block to divide the frequency of the clock during the initialization phase. The clock divider is realized as a simple block containing one or two D-flipflops which divides the clock frequency of the system by 2 or 4. In Figure 2 we show the structure of the clock division block for division of the clock frequency by 4. In some versions of Grain, division by 2 is sufficient to ensure correct operation during the initialization phase. Division by 3 would be suitable in some cases, but it would overcomplicate the hardware for only a modest speedup of the initialization phase. The clock division block is a very small component. Clock division by four gives area overhead of 25.67 GE and negligible power consumption. Grain always moves from the slower to the faster clock frequency and the run signal is set internally by the counter on the positive edge of the clock. Because of the delay in the production of the run signal, the first clock cycle of the key generation phase will be shortened, which could potentially lead to critical path violations in a performance-optimized design such as Grain. We can handle this problem by using a flip-flop in front of the run signal which is output by the counter. In this case, if the run signal rises to 1 after a positive edge of the faster clock signal, the clock of the system changes to the faster clock in the next positive edge of the system. This solution is shown in Figure 4 . We have synthesized the Fibonacci and the Galois versions of Grain using Cadence RTL compiler in the TSMC 90nm standard cell technology library. Since the synthesis tool does not handle multiple clocking, we set the two initialization loops as false paths and optimized the designs for the keygeneration phase. Table II shows the results for throughput, power consumption, area, and frequency. Area is measured in terms of NAND2 Gate Equivalents (GE). The total power consumption of the system is estimated as a combination of dynamic and leakage power for operation at 25 C, with a power supply of 1.2 V at 10MHz clock frequency as in [2] .
As we can see, the throughput for Galois 1bit/cycle Grain-80 and Grain-128 is more than doubled compared to Fibonacci.
Trivium is the highest ranked finalist in the eSTREAM project. In Table III , we compared the frequency and area of Trivium (T) and Grain-80 with Galois configuration(G). Both ciphers were implemented in TSMC 90 nm technology. Due to the Galois configuration, Grain-80 (1bit/cycle) is faster and smaller than Trivium (1bit/cycle), with a significantly better throughput/area ratio. This is an important result for applications such as RFID systems which require efficiency in both throughput and area. The throughput/area figures are compared graphically in Figure 5 , where the figures for the Fibonacci configuration (Grain(F)) of Grain-80 are also reported. 
