The Welch-Gong (WG) stream cipher family was designed based on the WG transformation and is able to generate keystreams with mathematically proven randomness properties such as long period, balance, ideal tuple distribution, ideal two-level autocorrelation and high and exact linear complexity. In this paper, we present a compact hardware architecture and its pipelined implementation of the stream cipher WG-16, an efficient instance of the WG stream cipher family, using composite field arithmetic and a newly proposed property of the trace function in tower field representation. Instead of using the original binary field F 2 16 , we demonstrate that its isomorphic tower field F (((2 2 ) 2 ) 2 ) 2 can lead to a more efficient hardware implementation. Efficient conversion matrices connecting the binary field F 2 16 and the tower field F (((2 2 ) 2 ) 2 ) 2 are also derived. Our implementation results show that the pipelined WG-16 hardware core can achieve the throughput of 124 MHz at the cost of 478 slices in an FPGA and 552 MHz at the cost of 12, 031 GEs in a 65 nm ASIC, respectively.
INTRODUCTION
With the advent of ubiquitous computing, communication security has moved more and more to the forefront of attention. Security is mandatory to ensure that the communication system is properly functioning and to prevent misuse. Stream ciphers are fast cryptographic primitives that provide confidentiality of electronically transmitted data. In general, when compared to other cryptographic primPermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. itives, stream ciphers are competitive in software applications with high throughput, and in hardware applications with small footprint. Their major applications, though by no means restricted to, are 4G telecommunication systems [3, 4] , IEEE 802.11 wireless networks [1] , Bluetooth [2] , digital video broadcasting systems like pay-TV and RFID tags [21, 23] .
The WG ciphers [19] refer to a family of synchronous and hardware oriented stream ciphers built from Linear Feedback Shift Registers (LFSRs) and Boolean functions with compact Algebraic Normal Forms (ANFs), which can be regarded as nonlinear filter generators over an extension field. One instance of the WG stream cipher family called WG-29 [18] was submitted to the ECRYPT Stream Cipher (eS-TREAM) project [8] and entered the second phase in 2005. Among more than 20 submissions, the WG-29 is the only candidate that has mathematically proven randomness properties such as ideal two-level autocorrelation, balance, long period, ideal tuple distribution, and exact linear complexity [6] . Those randomness properties are paramount for protecting communication systems from cryptanalysis by hackers. Moreover, the ideal two-level correlation sequences are very effective to combat channel noise since the autocorrelation will reach the maximum value after one period, thereby facilitating the synchronization between a transmitter and a receiver. Besides the stream cipher WG-29, other instances of the WG stream cipher family have been proposed to secure RFID systems [15] , resource-constrained smart devices [9] and 4G-LTE networks [10] .
Thanks to the attractive randomness properties of the WG stream cipher family, its hardware implementations have also attracted a lot of attention. Nawaz and Gong [18] described a hardware architecture for implementing WG-29 using normal basis, which requires seven multipliers and one inversion over F 2 29 . Their hardware design has been further improved in [19] by eliminating one multiplier through signal reuse and replacing the inversion with an exponentiation. In [13] , Krengel proposed an interleaved approach with precomputation which can achieve an 8-fold speed-up at the cost of 2 29 bits of ROM in hardware. Lam et al. [14] presented the hardware design of the MOWG, a multi-bit output variant of the original WG cipher. The authors optimized the proposed hardware architecture through extensive signal reuse as well as pipelining with reuse techniques. Recently, El-Razouk et al. [7] proposed a novel hardware design for WG-29 that is based on the efficient computation of the trace of a product of two finite field elements with type-II optimal normal basis (ONB) representations. Their ASIC implementations can achieve an improvement of 40% in area, 39% in dynamic power consumption and 17% in speed, when compared to previous results in the literature. Motivated by El-Razouk et al.'s work in [7] , we address compact hardware implementation of WG-16 [10] , an efficient instance of the WG stream cipher family, in this paper. Due to the lack of Gaussian normal bases [17] over F 2 16 , the method for computing the trace of a product of two field elements proposed in [7] cannot be directly applied to WG- 16 . However, we demonstrate that using the isomorphic tower field F (((2 2 ) 2 ) 2 ) 2 of F 2 16 as well as normal bases for all towerings the nice property of the trace function in [7] can be recovered. By combining efficient tower field arithmetic and low-cost basis conversion matrices, we propose a compact hardware architecture as well as its pipelined design for WG-16. Our implementations on FPGA (i.e., Spartan-6) and ASIC (i.e., 65nm CMOS technology) show that the pipelined WG-16 hardware core can run at the maximum frequency of 124 MHz and 578 MHz, at the cost 478 slices and 12, 031 gate equivalents (GEs) as well as 138 mW and 25.5 mW of dynamic power consumption on the target platform, respectively.
The rest of this paper is organized as follows. Section 2 gives some notations and a brief description of the stream cipher WG-16. In Section 3, we describe efficient tower field arithmetic in F (((2 2 ) 2 ) 2 ) 2 and derive low-cost basis conversion matrices. Section 4 proposes a compact hardware architecture for WG-16 based on a newly proposed property of the trace function. In Section 5, we describe a pipelined design for the proposed compact hardware architecture of WG-16. The FPGA and ASIC implementations of the pipelined design are discussed and compared to previous work in Section 6. Finally, Section 7 concludes this contribution.
PRELIMINARIES
This section defines some notations that will be used to describe the stream cipher WG-16 and its hardware architecture throughout this paper.
• F2 = {0, 1}, the Galois field with two elements 0 and 1.
• p(X) = X 16 + X 5 + X 3 + X 2 + 1, a primitive polynomial of degree 16 over F2.
• F 2 16 , the extension field of F2 defined by the primitive polynomial p(X) with 2 16 elements. Let ω be a primitive element of F 2 16 such that p(ω) = 0.
• F (((2 2 ) 2 ) 2 ) 2 , the isomorphic tower construction of F 2 16 .
• Tr
n−1 , the relative trace function from F2mn → F2m . If m is equal to 1 then Tr
(·) is just called the trace and denoted by Tr(·).
• l(X) = X 32 + X 25 + X 16 + X 7 + ω 2743 , a primitive polynomial of degree 32 over F 2 16 which is used as the feedback polynomial of LFSR.
• q(X) = X +X (i.e., a normal element) is used in this work.
• ⊕m, the bitwise XOR operator for two operands of m bits.
• m, the bitwise AND operator for two operands of m bits.
• ⊗m, the multiplication operator for two operands over F2m or its isomorphic fields.
Overview of the Stream Cipher WG-16
The stream cipher WG-16 [10] is a hardware-oriented keystream generator that consists of three main components as illustrated in Figure 1 : The stream cipher WG-16 operates in three phases, namely key/IV loading phase, initialization phase, and running phase, under the control of the FSM. During the key/IV loading phase, a 128-bit key and a 128-bit initialization vector (IV) will be first loaded into the LFSR within 32 clock cycles through the pin DIN. After loading the required key and IV, the initialization phase will be performed in the next 64 steps without any output. In the initialization phase, the input to the LFSR is the bitwise XOR of the linear feedback from the LFSR and the 16-bit intermediate value (i.e., a nonlinear feedback) from the WG-16 permutation module. The running phase starts from the 97-th step and one bit keystream will be generated in each clock cycle. In the running phase, the only input to the LFSR is the linear feedback within the LFSR. The recurrence relations for updating the LFSR in the initialization and running phases are summarized below:
TOWER CONSTRUCTIONS OF F 2 16 AND BASIS CONVERSION MATRICES
The hardware implementations of the stream cipher WG-16 involve finite field arithmetic (i.e., addition, multiplication, exponentiation and inversion) over F 2 16 . Since several exponentiations of the form X 2 ι (ι > 1) are computed during the evaluation of the WG-16 transformation, it is natural to utilize a normal basis of F 2 16 over F2 for efficient implementation. Although we can directly implement multiplications and inversions over F 2 16 using the normal basis, the hardware and time complexities of the resulting implementation are high due to the lack of Gaussian normal bases [17] over F 2 16 . To address the aforementioned issues, we employ the isomorphic tower construction of F 2 16 to achieve better performance. In order to describe our choice for polynomial and normal bases at each level of tower unambiguously, we use the field representations given in Table 1 as a reference. 
For obtaining the tower construction of F 2 16 , we first construct F 2 2 by using the irreducible polynomial e(X) over F2, then construct F (2 2 ) 2 by using a certain irreducible polynomial f (X) of degree 2 over F 2 2 , and then construct F ((2 2 ) 2 ) 2 by using a certain irreducible polynomial g(X) of degree 2 over F (2 2 ) 2 . Finally, we construct F (((2 2 ) 2 ) 2 ) 2 by using a certain irreducible polynomial h(X) of degree 2 over
Note that all individual field extensions have degree two, as illustrated below:
Tower Construction with Normal Bases
Since the efficiency of the arithmetic over F (((2 2 ) 2 ) 2 ) 2 is closely related to the selection of the irreducible polynomials as well as the bases for the towerings, we consider using normal bases for all towerings here, as illustrated in Table  2 .
Arithmetic operations in F 2 2 .
Let A = a0α+a1α 2 and B = b0α+b1α 2 , where a0, a1, b0, b1 ∈ F2. A multiplication C = AB is computed as follows (see Figure 2 (a)):
For a non-zero element A ∈ F 2 2 , the square (i.e., the Frobenius mapping with respect to F2) of A is calculated as follows (see Figure 2 (b)): Note that the inverse of A ∈ F 2 2 is equivalent to the square. Moreover, the multiplications of A ∈ F 2 2 by α and α 2 are carried out as follows (see Figure 3 ): 
Arithmetic operations in
Let A = a0β+a1β 4 and B = b0β+b1β 4 , where a0, a1, b0, b1 ∈ F 2 2 . A multiplication C = AB in F (2 2 ) 2 is computed as follows (see Figure 4 (a)): 
For a non-zero element A ∈ F (2 2 ) 2 , the square of A is calculated as follows (see Figure 4 (b)):
The Frobenius mapping of A with respect to F 2 2 , which is the 4 th power operation, is computed as follows:
Letting A be a non-zero element in F (2 2 ) 2 , the inverse of A, denoted by I, can be calculated by the Itoh-Tsujii algorithm (ITA) [12] as follows (see Figure 4 (c)):
(c) Inversion Unit I4 The multiplications of A ∈ F (2 2 ) 2 by λ, λ 2 , β and αβ are carried out as follows (see Figure 5 ): 
Let A = a0γ+a1γ 16 and B = b0γ+b1γ 16 , where a0, a1, b0, b1
2 ) 2 is carried out as follows (see Figure 6 (a)):
For a non-zero element A ∈ F ((2 2 ) 2 ) 2 , the square of A is calculated as follows (see Figure 6 (b)):
The Frobenius mapping of A with respect to F 2 4 , which is the 16 th power operation, is computed as follows:
Letting A be a non-zero element in F ((2 2 ) 2 ) 2 , the inverse of A, denoted by I, can be calculated by the ITA as follows (see Figure 6 (c)): The multiplication of A ∈ F ((2 2 ) 2 ) 2 by μ is carried out as follows (see Figure 7) : 
Let A = a0δ+a1δ 256 and B = b0δ+b1δ 256 , where a0, a1, b0, b1
2 is computed as follows (see Figure 8 (a)):
For a non-zero element A ∈ F (((2 2 ) 2 ) 2 ) 2 , the square of A is calculated as follows (see Figure 8 (b)):
The Frobenius mapping of A with respect to F 2 8 , which is the 256 th power operation, is computed as follows:
Letting A be a non-zero element in F (((2 2 ) 2 ) 2 ) 2 , the inverse of A, denoted by I, can be calculated by the ITA as follows (see Figure 8 (c)):
Efficient Conversion Matrices.
Two matrices MNT and MTN are needed for converting elements between normal basis and tower field representations. As noticed by Nogami et al. in [20] , these conversion matrices are easily found but they are not uniquely determined because the modular polynomials e(X), f(X), g(X) and h(X) have conjugate elements as zeros. In particular, efficient conversion matrices that lead to small critical path delay are rare. We conduct an exhaustive search with 16 conjugate variants of conversion matrix and the best pair of MNT and MTN is shown below: 
Note that the pair of MNT and MTN achieves WT(MNT )+ WT(MTN ) = 92+100 = 192, where WT(·) counts the number of 1's in a matrix (i.e., the weight of a binary matrix). Moreover, the Hamming weight of row vectors of MNT and MTN is less than or equal to 7 and 9, respectively. As a result, the critical path delays of implementing MNT and MTN using the tree structure [5] are 3TX and 4TX , respectively, where TX denotes the delay of a XOR gate.
Hardware and Time Complexities of Tower Construction
We summarize the hardware and time complexities of the building blocks in tower construction with normal bases in Table 3 , where NX (resp. TX) and NA (resp. TA) denote the number (resp. the delay) of XOR gates and AND gates, respectively.
The tower construction described in Section 3 allows a hardware architecture with a highly regular structure, having almost identical basic building blocks for each layer. This high level of regularity allows accurate prediction of area 
complexities for basic building blocks on higher lever of the tower field, based on results obtained in the base field. If we refer to Table 3 and compare area complexities for multipliers M2 and M4, we can observer that M4 will contain three M2 blocks (so 12 XOR gates and 2 AND gates), a Mα block (one XOR gate) and four 2-bit XOR gates, adding up to a total of 21 XOR gates and 9 AND gates in multiplier M4.
A COMPACT HARDWARE ARCHITEC-TURE OF THE WG-16 STREAM CIPHER
In this section, we describe a compact hardware architecture for the WG-16 stream cipher.
Properties of the Trace Function in Tower Field Representation
In [7] , El-Razouk et al. proved that when the elements in F2m are represented in a type-II ONB [11] the trace of the product of any two elements can be efficiently computed as the modulo-2 sum of coordinates of the bitwise ANDing of the two elements. Despite the lack of Gaussian normal bases over F 2 16 , we show that using the isomorphic tower field F (((2 2 ) 2 ) 2 ) 2 as constructed in Section 3.1, the above nice property still holds. Lemma 1. Given the tower construction in Section 3.1, we have the following basic facts:
• For any two elements A = a0δ + a1δ
have Tr
• For any two elements A = a0γ + a1γ 16 and B = b0γ + b1γ 16 in F ((2 2 ) 2 ) 2 , where a0, a1, b0, b1 ∈ F (2 2 ) 2 , we have
(AB) = (a0 ⊗4 b0) ⊕4 (a1 ⊗4 b1).
• For any two elements A = a0β +a1β 4 and B = b0β +b1β 4 in F (2 2 ) 2 , where a0, a1, b0, b1 ∈ F 2 2 , we have Tr
(AB) = (a0 ⊗2 b0) ⊕2 (a1 ⊗2 b1).
• For any two elements A = a0α+a1α 2 and B = b0α+b1α 2 in F 2 2 , where a0, a1, b0, b1 ∈ F2, we have Tr
(α 2 ) = 1 and using the multiplication formulae derived in Section 3.1, the results follow. Proposition 1. Given the tower construction in Section 3.1, the trace of the product of any elements U = (u0, u1, . . . , u15) and V = (v0, v1, . . . , v15) in F (((2 2 ) 2 ) 2 ) 2 can be computed as the modulo-2 sum of the coordinates of the bitwise ANDing operation of U and V , i.e., Tr(UV ) = 
where uj..j+7 ∈ F ((2 2 ) 2 ) 2 for j = 0, 8, uj..j+3 ∈ F (2 2 ) 2 for j = 0, 4, 8, 12, uj..j+1 ∈ F 2 2 for j = 0, 2, 4, . . . , 14 and uj ∈ F2 for j = 0, . . . , 15 and V has a similar representation. The trace of the product of U and V then can be computed as follows: (2 2 ) 2 ) 
Tr
where each equation follows from one basic fact in Lemma 1.
Corollary 1. Given the tower construction in Section 3.1, for any elements X = (x0, x1, . . . , x15), U = (u0, u1, . . . , u15) and V = (v0, v1, . . . , v15) in F (((2 2 ) 2 ) 2 ) 2 we have Tr(X Proof. Given the tower construction in Section 3.1, the element 1 ∈ F 2 16 can be denoted by (1, 1, . . . , 1) . Therefore, we immediately obtain Tr(X) = 15 i=0 xi by setting U = X and V = 1 in Proposition 1. Noting that Tr(X 2 w ) = Tr(X) for any X ∈ F (((2 2 ) 2 ) 2 ) 2 , we obtain the following result by setting X = UV and using the Proposition 1:
Corollary 2. Given the tower construction in Section 3.1, for any elements U, V and W in F (((2 2 ) 2 ) 2 ) 2 we have
Proof. The proof is the same as the Corollary 2 in [7] .
An Optimized Hardware Architecture of the WGT-16(X d ) Module
Based on the Proposition and two Corollaries in Section 4.1, the WG-16 transformation WGT-16(X d ) can be computed as follows:
, where Y = X 1057 ⊕16 1 = X 
10 , which requires five multipliers M16 with a hardware complexity of NX = 1, 560 and NA = 405 and a critical path delay of 75TX + 5TA. It is not difficult to find that the second method is more efficient in terms of both hardware and time complexities. Therefore, the WG-16 transformation WGT-16(X d ) can be computed using four multipliers in total, where two multipliers are utilized for generating Y , one for computing Y 
Y
2 11 +1 M NT 16 (NB) X ≫ 5 ≫ 10 X 2 10 NB X 2 5 NB M NT M NT TF TF M 16 NB M 16 I 16 X X TF M 16 X 2 5 +1 X 2 5 TF X 2 10 X d M T N TF Y ≫ 11 Y Y 2 11 M NT NB NB M 16 TF MUX 1 MUX 4 TF MUX 3 MUX 2 TF ⊕ 16 M T N ≫ 1 Y 2 11 −1 M NT NB TF ≫ 11
An Integrated Hardware Architecture of the WGP-16(X d ) Module
In the key/IV initialization phase, a 16-bit output from WGP-16(X d ) needs to be used as a nonlinear feedback to the LFSR, where the WG-16 permutation WGP-16(X d ) can be computed as follows:
where Y = X 1057 ⊕161. Note that Y Figure 9 and a more detailed description of the pipelined design will be presented in Section 5.
Hardware and Time Complexities
The WG-16 hardware core is composed of three components: a) a FSM; b) a 32-stage LFSR 1 ; and c) an integrated WGT-16(X d )/WGP-16(X d ) module as illustrated in Figure 9 . Let NR, NA, NX , NO, and NI denote the number of Registers, AND gates, XOR gates, OR gates, and Inverters, respectively. We summarize the hardware complexity for implementing the WG-16 stream cipher in Table 4 . Let TR, TA, TX , TO, and TI denote the delay of a Register, an AND gate, a XOR gate, an OR gate, and an Inverter, respectively. We first notice that the delay through the LF-SR is significantly smaller than that through the integrated WGT-16(X d )/WGP-16(X d ) module, due to the less number of multipliers when compared to the WG transformation. The following two lemmas characterize the critical path delays of the initialization and running phases of the WG-16 hardware core. The longest paths of the initialization and running phases of the WG-16 hardware core are respectively given by TInit = 88TX + 8TA + TR + TO + 2TI and
The lemma can be easily obtained by adding the delays of the components on the longest pathes during the initialization and running phases. From the lemma, we can see that the maximum delay through the WG-16 hardware core is TRun = 94TX + 9TA + TR + TO + 2TI . Noting that the critical path delay in the above WG-16 hardware core is long, we present a pipelined design in the next section.
A PIPELINED DESIGN OF THE WG-16 STREAM CIPHER
In this section, we describe a pipelined design of the WG-16 stream cipher in order to achieve a higher throughput.
A Pipelined Hardware Architecture
For creating a pipelined design, the integrated hardware architecture WGP T in Figure 9 has been decomposed into two submodules module A and module B, as shown in Figure 10 . While the module A contains the common computational components that are shared by the initialization and running phase and outputs the values Y 
Analysis of Basic Building Blocks
In order to determine appropriate pipeline stages, all the basic building blocks (see Section 3) for performing the tower field arithmetic have been implemented as combinatorial circuits on the target FPGA and ASIC platforms. The area and delay of the basic building blocks on FPGA and AISC platforms are summarized in Table 5 .
The FPGA device (i.e., Spartan-6 XC6SLLX9) used in our implementation features 6-input and 2-output look-up tables (LUTs), which can implement any 6-input Boolean functions. Recall from Section 3 that the two output bits c0 and c1 of M2, namely c0 = (a0 + a1)(b0 + b1) + a0b0 c1 = (a0 + a1)(b0 + b1) + a1b1, are 4-input Boolean functions, computed on the same values of inputs a0, a1, b0 and b1. Hence, the M2 multiplication can be realized on one LUT, using both outputs. In M4 
Design of Pipeline Architecture
Based on the implementation of basic building blocks (see Table 5 ), it is obvious that using the inversion module I16 inside a pipeline stage is not the best option. Hence, the inversion module I16 (see Figure 8 (c)) has been implemented using four pipeline stages: 1) the initial multiplication module M8 and squaring module S8 in parallel; 2) the Mμ module; 3) the inversion I8 module; and 4) the last two multiplication modules M8 in parallel. We shall refer to this approach as pipelining at the I8 level. Similarly, we refer to pipelining at M8 level for M16 module (see Figure 8(a) ) implemented with inter-stage registers inserted between three parallel M8 modules and the Mμ module.
By inspecting the integrated hardware architecture WG-P T in Figure 9 , we identify the critical path for module A to be the data-path from X to Y −1 . Based on implementation results of this data-path with a few different pipeline register placements as well as previous discussion, the design of module_A pipeline narrows down to two options: pipelining the submodule at M16/I8 level, which results in a 7-stage pipeline, and pipelining at M8/I8 level, which requires two additional pipeline stages. Note that the initial exponentiations X In the latter case of M8/I8 pipelining (see Figure 11 ) the two multiplications are carried out over four pipeline stages. In both cases, the computation of Y −1 is implemented at I8 level and requires four pipeline stages. The resulting pipelined design of module A at M8/I8 level is illustrated in Figure 11 , where the vertical lines represent pipeline-stage borders. The three grey blocks demonstrate the pipelined design for M16 and I16 modules (see Figures 8(a) and 8(c) ) as explained before. For the M16/I8 level pipelining, one can omit two pipeline boarders inside two M16 modules in Figure 11 .
Regarding to the pipeline design of module B, we notice that two of the four multipliers belong to module B and can be reused to serially compute WGP-16(X d ). To conduct the serial computation of WGP-16(X d ), six additional multiplexers MUXi (i = 1, 2, . . . , 6) and three 16-bit registers Reg i (i = 1, 2, 3) are introduced into the hardware architecture of the WGT-16(X d ) module (see Figure 9 ). The outputs of six multiplexers and the updated states of three registers are summarized in Table 6 . 
module B is designed by employing a two-stage pipeline as shown in Figure 12 . While the solid lines represent the part of circuit used for the common computations of WGT- 16 
Finite State Machine (FSM)
The FSM takes as inputs the 1-bit signals clk and rst and generates four control signals, denoted by lfsr en, init, load, and sel. The control signals init and load determine one of the three phases at which the WG-16 hardware core stays, namely the key/IV loading phase, the initialization phase, and the running phase. The control signal lfsr en is used to clock the LFSR to accommodate the serial computation of WGP-16(X d ). In the initialization phase, a new WGP-16(X d ) value is available once every P +4 clock cycles, where P equals the number of pipeline stages in moduleA. Since every WGP-16(X d ) value (except the first one) is computed from the previous WGP-16(X d ) value XORed with previous LFSR feedback, the full potential of the pipeline can not be used during the initialization phase, which results in a throughput of
bits/clock in this phase (i.e., There are P idle cycles between two consecutive WGP-16(X d ) computations.). Two binary counters are used to keep track of the number of times LFSR has been clocked and the number idle clock cycles. The aforementioned 1-bit control signal sel is used to choose correct operands for the computation of WGP-16(X d ) (see Table 6 ). The running phase begins after 32 + 64(P + 4) clock cycles. Note that there are P + 2 additional idle cycles before the first bit of the keystream is available. Beyond this point, the core outputs a new keystream bit every clock cycle.
The integrated hardware architecture WGP T, which implements both WGP-16(X d ) and WGT-16(X d ) (see Figure 9 ), is basically a P + 4 stage pipeline. To reuse the two multipliers in module_B to compute WGP-16(X d ), we simply feed the pipeline with the same input X three times, each with appropriate value of the control signal sel, as described in the previous paragraph. In our pipelined design, we chose to pipeline at the level of M8 since this gave us the best tradeoff between clock speed, length of initialization phase, and area. Pipelining at a finer granularity (e.g., M4) will double the length of the initialization phase with only a small increase in clock speed. Another issue with the module B pipelining is that pipelining at a lower level does not only increase the number of pipeline stages (and hence the number of idle clock cycles), but also complicates the design of the FSM.
IMPLEMENTATION RESULTS AND COMPARISONS
In this section, we report the FPGA and ASIC implementation results of the proposed pipelined hardware architecture of the WG-16 stream cipher and compare our result with previous implementations of other instances of the WG stream cipher family. Our FPGA area and speed results are for Xilinx Spartan-6 FPGA device XC6SLX9 using Xilinx Synthesis Tool (XST) for synthesis and ISE for implementation [24] . All implementation results, including flip-flops, look-up tables, area (slices), speed (maximum frequency), and dynamic power consumption are obtained after post place-and-route phase and the dynamic power consumption is recorded at a frequency of 124 MHz. Moreover, we use the "-Power" option in the Mapping and Place-and-Route phases to further reduce the power consumption.
Our ASIC implementations provide area (GEs), speed (maximum frequency), and power consumption results for the 65 nm CMOS technology using Synopsys Design Compiler for synthesis [22] and Cadence SoC Encounter to complete the Place-and-Route phase. The area and speed results are obtained from the SoC Encounter's accurate area and speed reports after Place-and-Route phase. Furthermore, the dynamic power consumption is evaluated under the optimal frequency for the pipelined design. Table 7 presents the speed, area, and dynamic power consumption results for both FPGA and ASIC implementations.
The relative performance results of the two pipelined designs are similar for both FPGA and ASIC platforms as shown in Table 7 . While pipelining at M16/I8 level has a shorter pipeline and needs a smaller number of clock cycles for the initialization phase, the dynamic power consumption is higher, when compared to the design pipelined at M8/I8 level. The actual choice of pipeline design highly depends on the requirements of practical applications.
In comparison with the large instance WG-29 [19] , the stream cipher WG-16 is more efficient in terms of throughout, area, and dynamic power consumption, without decreasing the security level [16] . Therefore, the stream cipher WG-16 makes a good trade-off between security and performance among the instances of the WG stream cipher family.
CONCLUSION
In this paper, we propose a compact hardware architecture and its pipelined version for the stream cipher WG-16 based on the combination of efficient tower field arithmetic over F (((2 2 ) 2 ) 2 ) 2 as well as a newly discovered property of the trace function in tower field representation. Various formulae for performing efficient arithmetic over F (((2 2 ) 2 ) 2 ) 2 have been derived and low-cost basis conversion matrices have been found to conduct fast conversion of a finite field element between F 2 16 and its isomorphic tower field F (((2 2 ) 2 ) 2 ) 2 . Our FPGA and ASIC implementation results show that the WG-16 hardware core can achieve a throughput of 124 Mbit/s and 552 Mbit/s, at the cost of 478 slices and 12, 031 GEs in hardware and 138 mW and 25.5 mW dynamic power consumption, respectively. Based on these results, the stream cipher WG-16 is a competitive candidate for securing pervasive digital communication and computing systems.
