Abstract-This paper presents two new hardware designs of the Welch-Gong (WG)−128 cipher, one for the multiple output WG (MOWG) version, and the other for the single output version WG based on type−II optimal normal basis representation. The proposed MOWG design uses signal reuse techniques to reduce hardware cost in the MOWG transformation, whereas it increases the speed by eliminating the inverters from the critical path. This is accomplished through reconstructing the key and initial vector loading algorithm and the feedback polynomial of the linear feedback shift register. The proposed WG design uses properties of the trace function to optimize the hardware cost in the WG transformation. The application-specific integrated circuit and field-programmable gate array implementations of the proposed designs show that their areas and power consumptions outperform the existing implementations of the WG cipher.
I. INTRODUCTION

S
YNCHRONOUS stream ciphers are lightweight symmetric-key cryptosystems. These ciphers encrypt a plain-text, or decrypt a cipher-text, by XORing the plaintext/cipher-text bit-by-bit with the generated key-stream bits. The key-stream bits are produced using a pseudorandom sequence generator (PRSG) and a seed (secret key). Stream ciphers are heavily used in wireless communication and restricted in resources applications such as 3GPP LTEAdvanced security suite [1] , network protocols (Secure Socket Layer, Transport Layer Security, Wired Equivalent Privacy, and Wi-Fi Protected Access) [2] , radio frequency identification (RFID) tags [3] , and bluetooth [4] , to name some.
Traditionally, many hardware-oriented stream ciphers have been built using linear feedback shift registers (LFSRs) and a filter/combiner Boolean function. However, the discovery of algebraic attacks made such a way of design insecure [5] - [8] . Many nonlinear feedback shift registers-based stream ciphers have been proposed in the eSTREAM stream cipher project [9] , which have limited theoretical results about their randomness and cryptographic properties [3] , and therefore, their security depends on the difficulty of analyzing the design itself [3] , [10] . In addition, the arrival of the 4G mobile technology has triggered another initiative for new stream ciphers [11] , [12] . The randomness of the keystreams generated by the 4G LTE cryptographic algorithms is, however, hard to analyze and, also, some weaknesses have been discovered [13] - [15] . The Welch-Gong (WG) (29, 11) [29 corresponds to G F (2 29 ) and 11 is the length of the LFSR] is a stream cipher submitted to the hardware profile in phase 2 of the eSTREAM project [9] . It has been designed based on the WG transformations [16] to produce key bit-streams with mathematically proved randomness aspects. Such properties include balance, long period, ideal tuple distribution, large linear complexity, ideal two-level autocorrelation, cross correlation with an m-sequence has only three values, high nonlinearity, Boolean function with high algebraic degree, and 1-resilient [10] , [17] - [19] . The revised version of the WG (29, 11) [9], [10] does not suffer the chosen initial value (IV) attack [20] , [21] . The number of key-stream bits per run is strictly less than the number of key-stream bits required to perform the attack introduced in [22] . In addition, the WG cipher is secure against algebraic attacks [10] , [19] . Therefore, the WG (29, 11) is secure and has the randomness properties that cannot be offered by other ciphers and, hence, it has a potential that the WG stream cipher will be adopted in practical applications.
Despite of its attractive randomness and cryptographic properties, few designs have been proposed for the hardware implementations of the WG (29, 11) . Gong and Nawaz [18] adopt a direct design using computation in the optimal normal basis (ONB), which requires seven multiplications and an inversion over G F (2 29 ). The inversion using Itoh-Tsujii algorithm requires log 2 (28) + H (28) − 1 = 4+3−1 = 6 multiplications and 28 squarings in G F(2 29 ), where H (28) denotes the Hamming weight of 28 [23] . Nawaz and Gong [10] replaced the inversion operation with a computation of the power 2 k − 1 that requires four multiplications for k = 29/3 = 10 and reduced the other seven multiplications of the WG transformation in [18] by one through signal reuse. Krengel [24] uses a look-up based approach that uses 2 29 bits of ROM. In Lam et al. [25] , the authors propose a multiple-bit output version of the WG cipher, called multiple output WG (MOWG). The MOWG reduces the hardware cost through signal reuse by removing one multiplier from the WG permutation in [10] , whereas it generates d ≤ 17 output bits. Furthermore, [25] improves the hardware cost and throughput of the cipher through pipelining with reuse techniques. The keystream sequences generated by the MOWG cipher possess many of the WG keystream randomness properties [25] .
In this paper, a novel method for computing the trace of a product of two field elements is presented, when the representation is the type-II ONB. In addition, two designs are proposed. One for the MOWG cipher and the other one for the WG cipher (that was initially proposed in [18] ), demonstrated by application-specific integrated circuit (ASIC) and field-programmable gate array (FPGA) implementations. The proposed designs optimize the area by reducing the number of multiplications in the MOWG/WG transforms. This is done through signal reuse for the MOWG and through using the new trace properties for the WG. The ASIC and FPGA implementations of the proposed WG design show significant area and power consumption reductions and an improved speed compared with [10] .
This paper is organized as follows. Section II defines the terms, notations, and gives a brief background about the MOWG/WG cipher. Sections III and IV presents the new hardware designs of the MOWG cipher and WG cipher, respectively. Results based on FPGA and ASIC implementations of the new designs are discussed in Section V. Section VI concludes this paper.
II. PRELIMINARIES
This section defines the notations that will be used throughout this paper to describe the WG cipher and its operation. In addition, a brief introduction to the components and operation of this cipher is presented. 1) G F (2) , binary finite field with elements 0 and 1.
2) G F(2 m ), binary extension field with 2 m elements represented as m-bit binary vectors.
, and p is a positive integer, then, in NB. a) A 2 p = A p, represents the right cyclic shift of the coordinates of A, with respect to NB, p-times. b) A 2 − p = A p, represents the left cyclic shift of the coordinates of A, with respect to NB, p-times. 6) In NB, the addition of 1 to an element can be done by complementing the bits of that element. 7) The trace of any G F (2 m 
is the characteristic polynomial of an l-stages LFSR over G F (2 m ), from which the recurrence relation is obtained as Fig. 1 . WG generator [10] , [18] , [19] , [25] . IV is the input during the loading phase. (linear feedback ⊕ initial feedback) is the input during the key initialization phase. Linear Feedback is the input throughout the PRSG phase.
where
is the initial state of the LFSR. The architecture of the WG cipher is shown in Fig. 1 . The LFSR feedback polynomial
is a primitive polynomial of degree 11 over G F (2 29 ), where β = α 464730077 is the generator of the ONB and α is a root of the defining polynomial of G F(2 29 ) given by [10] 
The output of the LFSR at A i + 10 is filtered by an orthogonal 29-bit WG transformation G F(2 29 ) −→ G F (2) given by
is the WG permutation,
. This results in a binary key-stream of period 2 319 − 1 [10] , [18] . The MOWG cipher uses the same formulation presented in (5), however, without the trace. It outputs 17 concatenated bits arbitrarily selected from the 29 output bits of the WG permutation [25] .
The WG/MOWG ciphers consist of three phases of operations: loading phase (11 cycles), key initialization phase (22 cycles), and running phase. The reader is referred to [10] , [18] , [19] , and [25] for more details.
III. OPTIMIZED HARDWARE DESIGN
OF THE MOWG CIPHER This section presents a hardware design of the MOWG (29, 11, 17) cipher, where 29 corresponds to G F(2 29 ), 11 is the number of stages in the LFSR, and 17 is the number Fig. 2 . Proposed MOWG transformation. X = A i+10 ⊕1 is the bit-wise complement of the LFSR's output,
of output bits. In this design, the MOWG transform uses seven multipliers, compared with eight multipliers in [25] . In addition, in an attempt to improve the overall speed of the cipher, the LFSR is reconstructed to remove the inverters from the critical paths during the PRSG phase/initialization phase. In what follows, the reduced area MOWG transform design is first introduced, followed by presenting the LFSR/key and initial vector loading algorithm (KIA) algorithm changes for speed improvement. Then, the architecture of the finite-state machine (FSM) is discussed, and the section ends up by deriving formulations for the space and time complexities.
A. Reducing the Hardware Complexity of the MOWG Transformation
The hardware cost of the MOWG cipher is dominated by its transform's field multipliers. Any decrease in the number of these multipliers would minimize the area of the overall cipher. This subsection presents the architecture of the MOWG transform, where the number of field multipliers is reduced by 1 through signal reuse, compared with those in [25] .
The architecture of the proposed MOWG transform is shown in Fig. 2 . Through taking X 2 2k as a common factor of the exponent terms 2 2k + 2 k + 1 and 2 2k + 2 k − 1 in (6), this architecture can easily be obtained, where the WG permutation given by (6) is now computed as follows:
In the MOWG (29, 11, 17) , k = 10 and, hence, the signal X 2 k −1 requires four multiplications and four squaring operations (that is free of cost in ONB) [25] . In addition to the multiplication operations involved in computing the signal X 2 k −1 , (7) requires three more multiplications to generate the signals
. Therefore, the architecture of Fig. 2 requires a total of seven G F(2 29 ) multiplications. The inverter symbol denoted by (1) in this figure requires 29 NOT gates to generate X = A i+10 ⊕1 from the LFSR's output signal A i+10 . The signal X ⊕ X r 1 ⊕ X r 2 ⊕ X r 3 ⊕ X r 4 is obtained as the addition in G F(2 29 
. The signals X 2 k and X 2 2k are obtained by right cyclic shifts of X, k, and 2k times, respectively. X 2 k +1 is generated by multiplying X with
is the right cyclic shift of X 2 k −1 , k times, and X
is generated by multiplying X 2 k 2 k −1 with X in G F(2 29 ). In Fig. 2 , the coordinates of the output of X ⊕X r 1 ⊕X r 2 ⊕X r 3 ⊕ X r 4 in G F (2 29 ) are complemented by the inverter symbol denoted by (2) to generate all 29 bits of the WGPerm function of (7), which forms the initial feedback. Seventeen bits of the WGPerm are the output of the MOWG in the run phase [25] .
B. Improving the Critical Path of the MOWG Transform
The time delay through the MOWG transform dominates the delay of the overall cipher (Section III-D2). This subsection shows how to slightly reduce the delay through this transform. This is accomplished by removing inverter 1, and by reallocating inverter 2 away from the critical paths of the PRSG and key initialization phases. This reduces the delay of the critical path by an amount equivalent to the delay of two inverters. However, the MOWG transform delay is still the dominant because of the delays of five serially connected field multipliers. First, the required mathematical formulation is derived, then the resulting new architecture of the cipher is presented.
1) Formulation: During the key initialization and PRSG phases, inverter 1 in Fig. 2 generates the complement of A i+10 . Notice that this cell holds the feedback from the LFSR during the PRSG phase, and the bit-wise XOR of the LFSR feedback and the MOWG transform feedback during the key initialization phase. Therefore, to remove inverter 1, it requires the direct storage of the complement of these values in both phases. In other words, it is required to reconstruct the LFSR such that it generates a sequence 29 ) and {A i } is the sequence generated by (3) over G F (2 29 ). Sequence B is referred to as the complement sequence of {A i }. The following proposition shows how this is accomplished for an LFSR with a general feedback polynomial of degree l over G F (2 m ).
Proposition 1: Let B be the complement sequence of a sequence A = A i , 0 ≤ i ≤ 2 ml − 1 , where A i ∈ G F (2 m ) and A is generated by (2) . Then, B is generated by the following recurrence relation:
where j ≥ 0, and the initial state of B is
Proof: By definition
i=0 C i A i+ j ⊕ 1, and by noticing 2C i = 0 one obtains
Thus, the assertion is true.
Through noticing that X = 1 ⊕ A i+10 in (7), then, from Proposition 1, one can see that X is B i+10 . Notice that the term 29 ) is realized with a number of NOT gates equal to its Hamming weight. For the LFSR of the MOWG, replacing the coefficients of (3) in (8) gives
which has a Hamming weight equal to 28.
Inverter 2, on the other hand, realizes the addition of the field element 1 in (7). Notice that this addition of the term 1 can be implemented in different ways. One way is to add it to one of the terms X, X r 1 , X r 2 ⊕ X r 4 , or X r 3 before the summation of these terms. Doing so would reallocate inverter 2 from its current position. It is, however, required that this reallocation does not result in a delay higher than the current maximum delay of the MOWG transform. For this reason, the inverter is relocated to complement X before it is added to X r 1 . This is the path at the top of Fig. 2 , which has the lowest delay with only two G F(2 29 ) adders between inverters 1 and 2.
2) Modified KIA Algorithm: Modifying the MOWGs LFSR according to (8) requires its left most stage to hold the complement of the IV during the loading phase. Therefore, it is required to complement the IV input before it is loaded to the modified LFSR. This can easily be implemented by inserting 29 inverters at the multiplexer's input that receives the IV in Fig. 1 .
3) Architecture: Here, the overall proposed architecture of the MOWG (29, 11, 17) cipher is presented, as shown in Fig. 3 . In this figure, the FSM controls the input to the LFSR for each phase of operation. In the same figure, because of the bit-wise complement operator denoted by (a), the LFSR receives the complemented IV during the loading phase. Hence, after 11 clock cycles, the initial state of this LFSR, (B 0 , B 1 , . . . , B 10 ), is basically the complement of the initial state of the LFSR in Fig. 1 , i.e., B i = A i ⊕ 1, 0 ≤ i < 11. When the key initialization phase starts, the bit-wise XOR of the initial feedback and linear feedback applies to the input of the LFSR. Note that the Linear Feedback in Fig. 3 is generated by (8) , which is equivalent to B i = A i ⊕ 1, 11 ≤ i < 33 (complement of corresponding one in Fig. 1 ). However, the initial feedback signal in Fig. 3 has the same value as the one generated in Fig. 2 . This means that the input to the LFSR during the key initialization phase in Fig. 3 is complemented with respect to the one in Fig. 1 . Throughout the PRSG phase, the only input to the LFSR is the linear feedback signal 
C. Finite State Machine
This subsection exposes the architecture of the FSM and describes how it schedules the input to the LFSR throughout the three phases of operation. Fig. 4 shows the components of the FSM. The FSM has two inputs, namely clk and reset, 1-bit each, whereas there are two outputs denoted as op0 and op1. The reset input is pulled down before each run of the cipher. This forces the 11-bit one-hot counter to initialize to (1, 0, . . . , 0), i.e., output 0 is the only bit set to a high logic level. In addition, when the reset signal is low, the 2-bit binary counter resets its state to (0, 0). Because of the 1-bit Register connected to the AND gate at the reset input of the 11-bit one-hot counter, this counter starts incrementing one clock cycle after the reset signal gets pulled up. This assures that the 11-bit one-hot counter returns to its initial state after 11 clock cycles. Then, it triggers the 2-bit binary counter to increment that starts the initialization phase. The output of the 2-bit binary counter controls the cipher's phase of operation. This is done by generating the op0 and op1 signals according to Table I .
The op0 and op1 signals select one of the three inputs of the multiplexer in Fig. 3 and connect it to the input of the LFSR, during each phase. It is noted that the loading phase takes 11 clock cycles, then starts the key initialization phase that takes 22 clock cycles, followed by the run phase. During the run phase, the clock inputs of the 11-bit one-hot counter and the 2-bit binary counter become idle. 
D. Space and Time Complexities
This subsection provides the space and time complexities of the MOWG design in Fig. 3 .
1) Space Complexity:
The space complexity is evaluated in terms of number of gates in each component to obtain the overall hardware cost. Let N R , N A , N X , N O , and N I denote the number of 1-bit Registers, AND gates, XOR gates, OR gates, and inverters, respectively. a) MOWG transform: The transform dominates the hardware complexity of the MOWG design as it consists of seven field multipliers and four G F(2 29 ) adders. A G F(2 29 ) adder requires 29 XOR gates. Also, the multiplier in [26] is used for implementation, which has 841 AND gates and 1218 XOR gates. Therefore, the total hardware cost of the transformation is as listed in Table II. b) Linear feedback shift register: The LFSR has 11-stages of 29-bit shift registers and a feedback polynomial. The feedback polynomial is composed of one field multiplier (with a constant), 1 five G F(2 29 ) additions, and H (β ⊕ 1) = 28 inverters. Therefore, the hardware complexity of the LFSR is as listed in Table II. 1 A multiplication with a constant can be further optimized so that it contains few XOR gates. 
c) 4-to-1 29-bit multiplexer:
The 4-to-1 29-bit multiplexer is composed of a binary tree of three 2-to-1 29-bit multiplexers and two NOTs (selectors). Each 2-to-1 29-bit multiplexer is built from 29 parallel 2-to-1 1-bit multiplexers. A 2-to-1 one bit multiplexer consists of two AND gates and one OR gate. Therefore, the total cost of the 4-to-1 29-bit multiplexer is as listed in Table II .
d) Finite-state machine: From Fig. 4 , there are three AND gates, one XOR gate, and one inverter in the FSM. The 11-bit one-hot counter is simply an 11-stages circular shift register with set/reset inputs having the output of the last shift register fed to the input of the first one. The 2-bit binary counter is built from two JK flip-flops (FF). The two inputs of the first FF are pulled to high logic and its output drives the two inputs of the second FF. Thus, one can find the total number of one-bit registers as
Table II lists the number of gates in the FSM.
In addition to the above-mentioned components, the MOWG cipher contains two 29-bit bit-wise complement operators (inverter symbol (a) and inverter symbol (b) in Fig. 3 ) and a G F(2 29 ) adder (computing the bit-wise XOR of initial feedback signal and the linear feedback signal). Let
, and N MOWG X denote the number of OR gates, Inverters, 1-bit Registers, AND gates, and XOR gates in the MOWG of Fig. 3 , respectively. Therefore, by adding the corresponding number of gates in this G F(2 29 ) adder and in inverter symbols (a) and (b) to the number of gates in the FSM, the 4-to-1 29-bit multiplexer, the LFSR, and the MOWG transform (Table II) 2) Time Complexity: Here, the formulation for the critical path delay of the MOWG cipher (Fig. 3) is derived. There are three critical paths in the MOWG.
1) Critical path of the LFSR.
2) Critical path along the MOWG transformation during the key initialization phase. 3) Critical path along the MOWG transformation during the run phase.
The LFSR's path has one multiplication and five finite field additions. This results in a propagation delay of
where T A and T X denote the propagation delay of an AND and an XOR, respectively. The delay through a finite field multiplier is T A + 1 + log 2 (29) T X [26] . On the other hand, the delays through the two MOWG transform paths have five multipliers in series, which corresponds to a delay of
From (10) and (11), it is clear that the longest path of the MOWG cipher passes through its transformation. From Fig. 3 , the critical path of the proposed MOWG during the run phase includes the delays of a 29-bit Register, five field multipliers in series, and three G F(2 29 ) adders. These results in the delay are stated as
where T RunPh denotes the maximum propagation delay through the MOWG during the run phase. In the same figure, the critical path of the MOWG during the key initialization phase includes the delays of four G F(2 29 ) adders, five field multipliers, a 29-bit Register, and a 4-to-1 29-bit multiplexer.
Notice that the delay through the 4-to-1 29-bit multiplexer is equivalent to the delay through two 2-to-1 1-bit multiplexers in series. This is equivalent to the sum of the delays through two AND gates, two OR gates, and two inverters. Therefore, the delay of the MOWG during the key initialization phase is
Comparing (12) and (13), it is clear that T KIPh > T RunPh .
IV. LOW COMPLEXITY WG CIPHER
This section proposes a new design of the WG (29, 11) . The proposed WG design considers Fig. 3 with an added trace to the output of the WGPerm as the starting point for optimization. Properties of the trace function when the elements of G F (2 m ) are represented in ONB of type-II (that exists for m = 29 [27] ) are first introduced. The proposed WG design uses these properties to minimize the hardware complexity of its transform. Note that the proposed design eliminates some necessary signals for the generation of the initial feedback, which is required to conduct the key initialization phase of the cipher. Missing of the initial feedback signal is recovered by introducing a serialized scheme to generate it. At the end of this section, the hardware and the time complexities of the new implementation are provided.
A. Properties of the Trace Function for Type-II ONB
This section presents a method for computing the trace of a multiplication of two field elements when the representation is in the type-II ONB. In addition, two corollaries are deduced from the proposed method.
Fact 1 [28] :
In other words, a type-II ONB is a self-dual basis. Thus, Proposition 2 is achieved as follows.
Proposition 2: In a type-II ONB, the trace of the field multiplication of any two G F (2 m ) elements  A = (a 0 , a 1 , . . . , a m−1 ) and B = (b 0 , b 1 , . . . , b m−1 ) is computed as the inner product of A and B as follows:
Proof: The proof is completed by considering the following derivation:
where the last result is obtained using Fact 1. Proposition 2 implies that the trace of a field multiplication of two elements represented in type-II ONB is easily implemented in hardware using m AND gates and m − 1 XOR gates.
Corollary 1: In type-II ONB, the two relations below are valid for any two elements A and (15) and
a i+n b i+n (16) where n is a positive integer and the indices of a and b are computed modulo m. Proof: Let A and B be any two elements in G F (2 m ) and n an arbitrary positive integer. It is well known that
for any X ∈ G F (2 m ). Therefore, by replacing X with AB one obtains
Through using Proposition 2, the proof is completed by realizing that the squaring operation X 2 and the square root operation X 2 −1 are simply the right cyclic shift and the left cyclic shift of the coordinates of X with respect to the ONB, respectively. According to Corollary 1, the trace of the field multiplication of any two elements A and B, represented in type-II ONB, does not change if an n-bit cyclic shift (left or right) is applied to both elements in the same direction.
Corollary 2: Let C be a common factor of two or more G F (2 m ) elements AC, BC,…, etc, then, the following relation holds:
Proof: Let A, B,…, etc, be any two or more arbitrary elements from the finite field G F (2 m ). Then
where the last result follows from Proposition 2, and C ∈ G F (2 m ).
B. Optimizing the WG Transform's Hardware for the Run Phase
Here, it is shown how Proposition 2 and Corollaries 1 and 2 are used to further reduce the number of field multiplications in the WG transform in Fig. 3 (with trace). Before proceeding, it is important to mention that by applying (14) , one can generate the trace of the field multiplication of two elements A and B directly from A and B. However, the result of the multiplication operation, i.e., C = AB, will be lost. Therefore, it is important to apply (14) to the multiplication terms in (7), which are not used anywhere else. From Fig. 3 , the two signals X r 2 ⊕ X r 4 and X r 3 are used only as inputs to the trace function (after they are bit-wise XORed), whereas the signal X r 1 is required in generating X r 2 ⊕ X r 4 (Section II for the values of r i s). The first two signals are generated as follows:
Therefore, applying the trace function to (19) 
Using (20), the WG transformation becomes
Applying a right cyclic shift of 2k-stages to X and X
in the term T r X X 2 k 2 k −1 of (21) does not change the value of the trace
Using (22) in (21) gives Taking X 2 2k as a common factor in (23) one obtains
Notice that by applying Corollary 2 to (24), only one multiplication operation is required to generate X r 1 = X 2 k +1 (excluding the generation of the signal X 2 k −1 ). Fig. 5 shows the resulting architecture of the WG transform in (24) . This architecture uses five field multipliers, i.e., four multipliers less than the WG transform presented in [10] . In Fig. 5 , the key stream bits are obtained by
addition of the coordinates of 1 ⊕ X ⊕ X r 1 with respect to the ONB. On the other hand, notice that the signals X r 3 and X r 2 ⊕ X r 4 do not exist in the WG transform. This is because T r (X r 2 ⊕ X r 3 ⊕ X r 4 ) is generated directly from X 2 2k , X r 1 , X 2 k −1 , and X 2 3k 2 k −1 using an inner product operation, as it is stated in (24) . This absence of the two signals X r 3 and X r 2 ⊕ X r 4 resulted in the elimination of the initial feedback signal. The next subsection proposes a recovery method for generating the initial feedback signal, which is only used in the key initialization phase.
C. Serializing the Computation of the Initial Feedback Signal
This section presents a method for the recovery of the Initial feedback signal through serialized computation. To accomplish the multiplication operations during this serial computation, the existing finite field multiplier that is used in generating the signal X r 1 in Fig. 5 , is used. The proposed scheme generates the initial feedback signal by serially computing it over three consecutive clock cycles. Denote this complete round of the serialized initial feedback computation (three clock cycles) as an extended key initialization round. In addition, denote the single clock cycle version of this computation (as in the MOWG design) as a simple round. Therefore, with serialization, the entire key initialization phase requires 
3×22
= 66 clock cycles instead of 22 clock cycles (that is, 22 extended rounds instead of 22 simple rounds). It is noted that this only affects the key initialization phase without increasing the number of cycles required for the run phase.
The expansion of the key initialization round from one to three clock cycles is established through the support of a new FSMs control signal, namely, lfsr_clk (Fig. 6 ). This signal controls the clock input of the LFSR and triggers it to shift once every three clock cycles. In addition, to compute the initial feedback signal over three stages, a new hardware module denoted as the serialized key initialization module (SKIM) will be introduced (Fig. 7) . This module uses the available signals and the field multiplier that is used in the generation of X r 1 , in Fig. 5 . This module schedules the proper inputs to the field multiplier in each stage of the serial computation through some multiplexers. The output of these multiplexers are controlled by two new signals generated by the FSM, namely, s 0 and s 1 (Fig. 6) . The intermediate results, between two consecutive stages of the computation, are stored in internal 29-bit Registers of the SKIM module.
In the following, the FSM changes required for the support of the serialization process are first introduced. Then, the architecture and operation of the SKIM module and its integration to the WG transform in Fig. 5 are discussed.
1) Architecture and Operation of the Modified FSM:
Here, the new architecture and operation of the FSM are described. The architecture, which is shown in Fig. 6 , generates the new set of control signals lfsr_clk, s 0 , and s 1 . These are required for the serial computation of the initial feedback signal. Before each run of the cipher, the FSM resets its 11-bit one-hot counter to (1, 0, . . . , 0) and its 2-bit binary counter to (0, 0) (where the leftmost and rightmost bits, within the brackets, denote the lowest output bit and the highest output bit of the corresponding counter, respectively). This is done through pulling down the reset inputs. When the reset signal is released, the 2-bit binary counter becomes ready. At the same time, the 11-bit one-hot counter's reset input stays pulled down for an extra clock cycle. This is due to the 1-bit Register connected to the input of the AND gate that drives its reset input. This assures that the (1, 0, . . . , 0) state of the 11-bit one-hot counter consumes a clock cycle at the beginning of the loading phase. After 11 clock cycles, from the release of the reset signal, the 11-bit one-hot counter returns to the (1, 0, . . . , 0) state. At this point, it triggers the clock input of the 2-bit binary counter. The 2-bit binary counter changes its state to (1, 0), triggering the start of the key initialization phase. Then, the clk signal starts triggering the clock input of the 3-bit one-hot counter. The counting will, however, start one clock cycle later, when the output of the 1-bit Register connected to the 3-bit one-hot counter's reset input pulls up. This in turn assures that the 3-bit one-hot counter consumes one clock cycle, before incrementing its initial state of (1, 0, 0), at the start of the key initialization phase. During this phase, the first output bit of the 3-bit onehot counter drives the clock input of the 11-bit one-hot counter. Therefore, it takes 33 clock cycles for the 11-bit one-hot counter to complete 11 counts. Hence, it takes 33 clock cycles for the 2-bit binary counter to increment. Therefore, it requires 66 clock cycles for the 2-bit binary counter to increment twice to start the running phase. When the running phase starts, with the 2-bit binary counter's state at (1, 1) , the 11-bit and the 3-bit one-hot counters stop counting, as their clock inputs become idle.
Notice that during the key initialization phase, the lfsr_clk is driven by the first output of the 3-bit one-hot counter. Hence, the LFSR shifts once every three clock cycles. The two signals s 0 and s 1 are derived from the 3-bit one-hot counter's output according to Table III. Notice that this table is realized without any additional hardware by setting s 0 to be the second output and s 1 to be the third output of the 3-bit one-hot counter, respectively. Therefore, (s 0 , s 1 ) produces the three patterns of (0, 0), (1, 0), and (0, 1) during the first, second, and third stages of an extended key initialization round, respectively. During the running phase, (s 0 , s 1 ) will generate (0, 0). The following shows how these patterns are used to accomplish the proper functionality in the key initialization phase as well as in the running phase. 2) Architecture and Operation of the SKIM: Here, the SKIM module, which performs the serialized computation of the initial feedback signal over an extended key initialization round (three clock cycles), is presented. Fig. 7 is a block diagram describing the architecture of this module. During the extended key initialization round, the two signals s 0 and s 1 in Fig. 7 change values in each stage as mentioned in the previous section. These two signals control the outputs of the three multiplexers MUX 1 , MUX 2 , and MUX 3 according to Table IV .
In each stage of the extended key initialization round, the SKIM module computes a partial value of the initial feedback signal and stores it in Register 2 ( Fig. 7) .
During the first clock cycle, s 0 and s 1 are both at low logic levels. Hence, MUX 1 , MUX 2 , and MUX 3 generate the signals X 2 k , X, and X ⊕ 1 at their outputs, respectively. The output of the multiplier becomes X r 1 = X 2 k +1 and that of the G F(2 29 ) adder is X r 1 ⊕ X ⊕ 1. Upon receiving a new clock signal, i.e., at the start of the second clock cycle, Register 1 and Register 2 update their states with the output signal of the multiplier and output of the G F(2 29 ) adder, respectively. In addition, X 2 k −1 is stored in a 29-bit Register (see Fig. 8 ). At the same time, s 0 pulls up forcing the outputs of MUX 1 , MUX 2 , and MUX 3 to become X r 1 ⊕ X 2 k −1 , X 2 2k , and X r 1 ⊕ X ⊕ 1 (the state of Register 2 when the clock signal arrived), respectively. With these settings of the multiplexers and the registers, the multiplier output changes to X r 2 ⊕X r 4 = X 2 2k X r 1 ⊕ X 2 k −1 and that of the G F(2 29 ) adder to X r 4 ⊕ X r 2 ⊕ X r 1 ⊕ X ⊕1, denoting Register 1's and Register 2's next states, respectively, when the third clock signal arrives. When the third clock cycle starts, s 0 changes to low logic level while s 1 changes to high logic level, which forces MUX 1 , MUX 2 , and MUX 3 to generate X 2 k 2 k −1 , X, and X r 4 ⊕ X r 2 ⊕ X r 1 ⊕ X ⊕ 1 at their outputs, respectively. The multiplier and the G F(2 29 ) adder outputs become X r 3 = X 2 k 2 k −1 +1 and X r 4 ⊕ X r 3 ⊕ X r 2 ⊕ X r 1 ⊕ X ⊕1, respectively.
At the arrival of the fourth clock signal (the beginning of a new extended key initialization round) s 0 and s 1 both change back to low logic levels, the LFSR is clocked and latched with the result of the bit-wise XOR of the computed initial feedback signal (X r 4 ⊕ X r 3 ⊕ X r 2 ⊕ X r 1 ⊕ X ⊕ 1) and the LFSRs linear feedback signal. At the arrival of the 67th clock signal, the LFSR would have been clocked 22 times and the running phase starts.
Throughout the run phase, both s 0 and s 1 stay at logic level 0; therefore, MUX 1 generates the signal X 2 k and MUX 2 generates the signal X. With these values, the multiplier Proposed WG transformation after integration with the SKIM module. Block denoted by IP generates the inner product of the two 29-bit inputs (Section II), whereas ⊕ adds the 29-bits at its input over G F (2) . Double-headed arrows under a component (correspond to inserted registers) and the dotted arrow output (initial feedback), are used for pipelining (Section V-B). Numbers under a register specify the clocking of that register within the pipelined scheme, during initialization phase.
generates X r 1 and the WG transform in Fig. 8 produces a stream bit for each cycle.
D. Space and Time Complexities
This section begins with presenting the hardware complexity of the proposed WG implementation, followed by the time complexity.
1) Space Complexity:
The space complexity of the WG transform is reduced, whereas that of the WG's FSM is slightly increased, compared with the corresponding ones in the proposed MOWG. In what follows, the hardware complexities of the WG transform and its FSM are first summarized. Then, the overall hardware cost of the WG design is obtained. a) WG transformation: The space complexity of the WG transform has been improved compared with the MOWG transform. This is mainly because the number of field multipliers in the WG transform is reduced by 2 with respect to that in the MOWG transform. On the other hand, compared with the MOWG transformation in Fig. 3 , the design in Fig. 8 has the following additional components: 1) a G F(2 29 ) adder; 2) a 29-bit G F (2) addition; 3) three 29-bit Registers; 4) an XOR gate; 5) an OR gate; 6) one 4-to-1 29-bit multiplexer; 7) two 2-to-1 29-bit multiplexers with 2 selector NOTs; and 8) an inner product. A 29-bit G F (2) adder consists of 28 XOR gates. A 2-to-1 29-bit multiplexer consists of 29 parallel 2-to-1 1-bit multiplexers. The inner product has 29 AND gates and 28 XORs. Details about the hardware of the other components are listed in Section III-D1. Through adding the hardware of the additional components to the gate count in the MOWG transform (Table II) , and then subtracting the hardware cost of two field multipliers, the total hardware cost of the proposed WG transform is obtained as listed in Table V the 11-bit one-hot counter, the 3-bit one-hot counter is simply composed of a three stages circular shift register with set/reset inputs having the output of the last shift register fed to the input of the first register. Through adding the gates in the mentioned components to the number of gates of the FSM in Fig. 4 (Table II) , the total hardware cost of the FSM in Fig. 6 is as shown in Table V . The LFSR and the 4-to-1 MUX of the WG have same complexities as the ones in the MOWG (Table II) . In addition, the WG design contains two 29-bit bit-wise complement operations [inverter symbol (a) and inverter symbol (b) in Fig. 3 ] and a G F(2 29 ) adder (computing the bit-wise XOR of initial feedback signal and the linear feedback signal). Let
denote the number of OR gates, inverters, 1-bit Registers, AND gates, and XOR gates in the proposed WG cipher, respectively. Therefore, through adding the corresponding number of gates in the G F(2 29 ) adder and in inverter symbols (a) and (b) to the number of gates in the 4-to-1 multiplexer, the LFSR (see Table II ), and in the FSM, and the WG transform (Table V) 
2) Time Complexity:
Here, the formulation for the critical path of the proposed WG design is derived. Notice that the LFSR delay in the WG is not a candidate for the critical path, because it still has less multipliers contributing to its delay, compared with the WG transform. In what follows, the formulation of the longest path during the key initialization phase is presented. After this, the running phase is proved to be the longest path of the cipher.
Let T clock ≥ T KIPh denotes the minimum clock period in the WG during the key initialization phase. During the three stages of an extended key initialization round, in order, the following three conditions hold:
where the right hand sides in (25) , (26), and (27) are simply the propagation delays during the first (generating X 2 k −1 ), second, and third stages of the extended key initialization round, respectively. It is clear that the right hand side of (25) is the largest, and hence, the longest path during the key initialization phase of the WG is
The delay of the longest path through the WG during the running phase is easily obtained by adding the delays of its components as follows:
From (28) and (29), the critical path of the cipher is (29).
V. RESULTS AND COMPARISONS
The following sections compare the proposed designs of the MOWG (29, 11, 17) and the WG (29, 11) ciphers with the corresponding previous implementations in [25] , [10] , and [24] . In addition, further optimizations and general applicability of the proposed algorithms are discussed.
A. Results from FPGA and ASIC Implementations
The proposed WG and MOWG designs, together with the WG in [10] , have been realized using ASIC and FPGA implementations. The ASIC speed and area results are for the 65-nm CMOS technology based on Synopsys Design Compiler's estimate of area and clock speed before placeand-route with medium effort for optimizations. The power consumption readings have been conducted under 140-MHz frequency for all the designs. The FPGA designs have been synthesized using Xilinx Synthesis Tool [29] . The FPGA area and speed results are for Xilinx Virtex4 series FPGA device xc4vfx12sf363-10. All results are for post place-and-route and the power consumption results have been recorded for a frequency of 29 MHz for all the designs. The reported ASIC and FPGA results are listed in Tables VI and VII, respectively. Furthermore, theoretical results for the WG design in [24] are listed in Table VI . The WG-7, in the same table, is another member of the WG family based on an LFSR over G F 2 7 . In Tables VI and VII, the readings shown from the MOWG design in [25] were reported for the pipelinedwith-reuse version of the transform. The following paragraphs analyze the reported results and compare the proposed WG and MOWG designs with the previous ones in the literature. The reported results show that the proposed WG takes longer to finish its initialization phase compared with the one in [10] (293 ns (ASIC)/1.94 ms (FPGA) in the proposed scheme compared with 152 ns (ASIC)/0.73 ms (FPGA) in [10] ). This is not significant because initialization is executed only once per a run. The reported results also show that the proposed WG is superior to the one in [10] in terms of throughput, area, and power consumption. The proposed WG has lower latency, by 36% (ASIC) and 12% (FPGA), with respect to the one in [10] . In addition, accordingly, the speed/throughput of the proposed WG is increased by 55% (ASIC) and 13% (FPGA), compared with [10] . In addition, notice that the normalized throughput (proposed) is twice the one in [10] . This is due to the higher throughput and the significant reduction in area (area reduced by 40% for ASIC and by 37% for FPGA) of the proposed WG compared with the one in [10] . In addition, one can see that the proposed WG consumes less power (39% ASIC, 51% FPGA) and uses less than half the energy reported for [10] .
The WG design in [24] requires 2 m ROM bits for a general WG over G F (2 m ). The area of the proposed WG is dominated by its field multipliers, which have space complexity quadratic in m. Specifically, for the WG(29, 11), 2 29 -bits of ROM are required in [24] (in addition to 9000 XORs and 319 registers). There are no results in [24] about the running speed of the presented WG. According to a similar study on ROM-and multiplier-based MOWG designs by [25] , ROM-based ASIC implementations are always larger and slower than using field multipliers, for m > 11.
The proposed MOWG design is expected to offer better area and speed compared to the one presented in [25] . The proposed MOWG has eight multipliers compared with nine in [25] . Therefore, its area is expected to be scaled down by a ratio close to 8/9 with respect to the one in [25] . It is noted that the results from [25] are reported for the pipelinedwith-reuse version of the transform. Applying pipeline-withreuse techniques to the proposed MOWG would result in speed and area readings similar to the ones reported in [25] . For the nonpipelined and the pipelined (without reuse) versions, however, the proposed MOWG is expected to show lower area and a slightly higher speed/throughput, and lower latency, compared with the corresponding versions from [25] . This is due to the removed multiplier and the removed inverters from its critical path (Fig. 3) . Notice that a 6-stage pipeline of the proposed MOWG offers 6-times the throughput that is reported for its nonpipelined version in Tables VI and VII (Section V-B) . That is, almost double the throughput provided by the pipelinewith-reuse MOWG in [25] .
The proposed WG offers higher clock speed, and better area and power consumption, compared with the proposed MOWG. The proposed MOWG has, however, higher throughput and better energy per bit. Most important, the WG has more good randomness properties than the MOWG cipher [10] , [25] . Therefore, when security and randomness are critical for the application, the proposed WG design is preferred. If instead, throughput and area are the critical criteria for the application, then, the proposed WG design is superior for low area applications, whereas the proposed MOWG serves better for high speed applications. It is noted that one can apply serialization or pipelining to the WG/MOWG transforms for achieving lower area or higher throughput, if it is demanded by the application. This is discussed in the next section.
B. Discussion
This section discusses the serialization and pipeline techniques as further optimizations to the proposed WG and MOWG. In addition, the applicability of the proposed techniques to general MOWG/WG ciphers, in the NB, are considered.
For low throughput applications, smaller area can be achieved by serial computation of the MOWG/WG transforms. Fig. 9 shows how this is done using one multiplier. In this figure, the dotted square is used, only, for generating the WG stream bits. The rest of the diagram is common for MOWG and WG. The initialization round takes eight cycles for both transforms. During run phase of the MOWG, 17 output bits are generated every seven cycles. For the WG, a stream bit is produced every six cycles. The maximum propagation delay is equivalent to 17 levels of gate delays. Compared with 38 levels in (29) (WG) and 46 levels in (13) (MOWG), the clocks of the serial WG and MOWG are 2.2 and 2.7 times faster, respectively. Therefore, the throughput of the serial versions of the WG and the MOWG ciphers are almost 2/6 and 3/7 of the corresponding original ones in Figs. 3 and 8 , respectively. The total gate counts for the serial versions of the transforms are 4155 (WG) and 4011 (MOWG). Compared with 11053 gates in the WG transform (Section IV-D1) and 14529 gates in the MOWG transform (Section III-D1), then, the area of the serial versions of the WG/MOWG transforms are almost 2/5 and 2/7 of their original architectures, respectively. If even lower area is demanded, a digit-level field multiplier [30] , [31] can be deployed, adding more cycles for each multiplication.
The proposed schemes can achieve higher throughput through pipelined transforms. The LFSR should be reconstructed using the Galois-style feedback, or simply by placing the multiplication with β in between cells B i+1 and B i . Otherwise, the LFSR's speed will constrain the pipelining. Fig. 3 shows how to achieve a 6-stage pipeline of the MOWG transform using 19 29-bit registers. The pipelined MOWG critical path has seven levels of logic gate delays. The corresponding throughput and run phase latency are 17/(T A + 6T X ) and 6 (T A + 6T X ), respectively. Because (13) has 46 levels of logic gate delays, thus, the throughput of the pipelined MOWG is almost six times higher. Similarly, Fig. 8 shows a six-stage pipeline of the WG transform. From this figure, one can find the pipelined WG's latency and throughput as 6 (T A + 6T X ) and 1/(T A + 6T X ), respectively [the latency during initialization is higher, i.e., 8 (T A + 6T X )]. Compared with the throughput that results from (29) , this is almost five times higher. For even higher throughput, the unfolding technique presented in [32] can be deployed. Simply, the MOWG/WG LFSR is folded to generate n outputs (2 ≤ n ≤ 11) per a cycle. Hence, by implementing the same number of transforms, the throughput will be n-times higher at the expense of a proportional area increase.
Notice that (7) is a general form of the WG permutation (for any MOWG(m, l, d) ). Because squarings are cyclic shifts in the NB, then, only the architecture of the power 2 k − 1 will vary for different values of k = m 3 . Through having the WGPerm, the MOWG transform is just a proper selection of d bits from the WGPerm [25] . In addition, notice that the compliment LFSR in (8) is general for any G F (2 m ). Similarly, except for the power 2 k − 1, (24) is general for any WG(m, l). However, (14) is only applicable to G F (2 m ) where self-dual NB exist. Therefore, if there is not self-dual NB [33] , the inner product that is used to compute T r X 2 2k X r 1 ⊕ X 2 k −1 ⊕ X 2 3k 2 k −1 in Figs. 5 and 8 should be replaced with a field multiplication followed by a trace.
It is interesting to investigate the WG implementation in the PB. It is known that the PB offers area efficient multipliers, compared with the NB representation. There is, however, a penalty because of the additional space and propagation delay introduce by the squaring operations.
VI. CONCLUSION
Two new designs for the MOWG (29, 11, 17) and the WG(29, 11) ciphers have been proposed. As compared with the MOWG presented in [25] , the proposed MOWG reduces the number of field multipliers in the transform by one through signal reuse. In addition, it increases the speed by eliminating two inverters delay from the critical path. This is accomplished by reconstructing the KIA and feedback polynomial of the LFSR. The proposed WG is an optimization of the proposed MOWG with trace (WG version).
It is obtained through using the new properties of the trace function for type-II ONB, accompanied with serialized computation of the initial feedback signal during key initialization phase.
The proposed designs have been implemented on ASIC and FPGA. The ASIC implementations show that the proposed WG implementation achieves better results compared with [10] for area, speed, and power consumption. The WG improves the power consumption by a 39% reduction, area by a 40% reduction, and speed by an increase of 55%. Similarly, the FPGA implementations show that the proposed WG achieves better results for area, speed, and power consumption compared with [10] . The power consumption is reduced by 51%, the area is reduced by 37%, and the speed is increased by 13%.
Based on these results, the proposed implementations of the MOWG (29, 11, 17) cipher and the WG (29, 11) cipher are promising candidates for high speed and limited resources platforms, respectively, where throughput, area, and power consumption are of critical importance and the guaranteed randomness properties are required.
Hayssam El-Razouk received the B.Eng. degree in electrical and computer
