Abstract-In this paper, we propose a high-throughput pipeline architecture of the stream cipher ZUC which has been included in the security portfolio of 3GPP LTE-Advanced. In the literature, the schema with the highest throughput only implements the working stage of ZUC. The schemas which implement ZUC completely can only achieve a much lower throughput, since a self-feedback loop in the critical path significantly reduces the operating frequency. In this paper we design a mixed two-stage pipeline architecture which not only completely implements ZUC but also significantly raises the throughput. We have implemented our architecture in FPGAs and ASICs. In FPGAs platform, the new architecture increases the throughput by 45%, compared with the latest work, and particularly the new architecture also saves nearly 12% of hardware resources. In the 65nm ASIC technology, the throughput of the new design can up to 80Gbps, which is 2.7 times faster than the fastest one in the literature, in particular, it also saves at least 40% of hardware resources. In addition to the academic design, compared with the fastest commercial design, the new architecture doubles the throughput of that. To the best of our knowledge, this evaluation result is so far the best outcome.
I. INTRODUCTION
ZUC [1] is a word-oriented stream cipher and consists of two stages (the initialization stage and the working stage). ZUC has three logical layers. The top layer is a linear feedback shift register (LFSR) with 16 cells, the middle layer is the bitreorganization, and the bottom layer is a nonlinear function F. In the initial stage, LFSR is constructed using a 128 bit key, a 128 bit IV and a 240-bit long constant string, and during the first 32 iterations, the output of the FSM is added to the feedback loop for LFSR update [2] . After the first 32 iterations, ZUC moves into the working stage and outputs 32 bits of key per iteration.
Since the throughput of hardware implementations of ZUC is determined by the ratio of operating frequency to the number of clock cycles to generate per 32-bit key, we use T to denote the number of clock cycles. In order to acquire a high throughput, we should diminish T or increase the operating frequency. In the contemporary proposed works, T = 1 is quite often applied in order to achieve a high throughput, which means that the LFSR needs updating per clock cycle to realize an output of 32-bit key every clock cycle.
Operating frequency is determined by the critical path in ZUC. The critical path of the ZUC in hardware implementations is determined by the updating routine of the LFSR. The updating routine of LFSR employs a series of modulo 2 31 − 1 multiplications and additions, while the addition module is a time-consuming and resource-consuming component, due to which this data path is much longer than others.
In the literature, many works try to shorten this path to increase the throughput of the ZUC. The highest one [3] only implemented the working stage of ZUC, which is not applicable in practical applications. The scheme proposed in INDOCRYPT2011 by Gupta et al [2] implemented the two stages of ZUC, however, the self-feedback loop in their critical path significantly reduces operating frequency of their architecture.
As the critical path in the initialization stage is longer than that in the working stage, the throughput of the previous works [4] - [6] which includes both stages are much lower than that work [3] which includes the working stage only.
Contribution.
Our primary contribution is that we propose a novel mixed two-stage pipeline architecture of ZUC to considerably increase the throughput of ZUC in hardware. In FPGAs platform, the new architecture increases the throughput by 45%, compared with the latest work [7] , and saves nearly 12% of hardware resources. In the ASIC platform, our new architecture shortens the critical path by two-thirds by comparison with the work [6] .
In order to verify the accuracy of our mixed two-stage pipeline architecture, we have implemented our architecture in Xilinx V5 and Xilinx V6 FPGA, achieving a throughput of 7.9 Gbps with 350 slices and 11.2 Gbps with 328 slices respectively. It is deduced that our evaluation results are so far the optimal outcomes in Xilinx FPGA given a comparison of existing pipeline and non pipeline architecture properties.
We also implement the design in the 65 nm ASIC platform, as the result shown, our design significantly improves the performance of ZUC in the ASIC platform comparing with the academic designs and the commercial designs. • carry = 1, set v = a + b + 1.
• carry = 0, set v = a + b.
1) Method 1:
This method is used in THREE-ZUC, and the architecture is shown in Fig.1 . This is a direct way to implement the modulo 2 31 − 1 addition, which concatenates two 31-bit adders directly, and the delay of this method is that of two 31-bit adders.
2) Method 2:
Based on the observation of the long delay in Method 1, Liu et al [3] proposed this method to shorten the delay of Method 1. In order to calculate A + B + 1, Carry 0 is set to 1, A, B as inputs of one adder, and Carry 0 = 0, A, B as inputs of another adder. A + B and A + B + 1 can be computed at the same time, and the last result can be selected by the carry bit of A + B. The delay in this method is lower than that of Method 1. The architecture of this method is shown in Fig.1 .
3) Method 3:
Another adder which is widely used in hardware design is carry save adder (CSA). It is simply a parallel ensemble of k full adders without any horizontal connection [8] . When adding together three or more numbers, using a CSA followed by a carry propagate adder(CPA) is faster than using two CPAs. For example, in order to calculate (A + B + C) (mod 2 31 − 1), first use equation (1,2) produces two integer Carry and Sum, and then use a CPA to get the last result.
It is straightforward to find that A+B+C (mod 2 31 −1) is equal to (Sum + (Carry ≪ 1)) (mod 2 31 − 1). Consequently, if using this CSA architecture to implement the modulo 2 31 − 1 addition, it will be very efficient as the total delay of this method is shorter than that of previous methods, and moreover, less hardware resources is needed in this method.
The disadvantage of this method is that the operand number of the CSA modulo 2 31 − 1 adder must be more than three, 
B. The Mixed Two-Stage Pipeline Architecture of ZUC
To achieve a high-throughput ZUC hardware implementation, the LFSR should be updated per clock cycle for the purpose of producing 32-bit key works in a higher operating frequency. Although the long path is undertaken in the initialization stage only, yet it slows down the operating frequency in the entire ZUC implementation. It doesn't seem worthy losing the greater for the less.
Based on this observation, we propose a new architecture. In this new architecture, the LFSR is updated every other clock cycle in the initialization stage, and per clock cycle in the working stage, by this means, we increase the operating frequency, thus the throughput of the key is boosted since the 32-bit key words generating per clock cycle as other architectures proposed before. In particular, the new architecture consumes less hardware resources. Next we will give the detailed information on this new architecture.
1) The initialization stage: As discussed above, in this stage the LFSR is updated every other clock cycle, that means we can divide the original critical path into two sub-blocks, namely Pipeline Stage 1 and Pipeline Stage 2 respectively. Fig.  2 shows the structure of our architecture. From Fig. 2 , the data path in the Pipeline Stage 2 is longer than in the Pipeline Stage 1, but the former is much shorter than THREE-ZUC as seen in Fig.2 , it is about one fourth of the THREE-ZUC, because the modular adder using in THREE-ZUC adopts Method 1, while the Method 2 is used in our architecture.
It occurred to us that the updating job of LFSR can be promoted by pipeline construction, since S 15 is the only necessity for Pipeline Stage 2 in the working stage, and the value of S 15 required in Pipeline Stage 1 can be achieved by pre-computation.
Utilization of the revised method can help boost the throughout of the new design with a considerable increase compared with all the previous works. 2 can be found in Fig.3 and in Fig.4 respectively. In the Pipeline Stage 1, the first three CSA modular adders are used to calculate (Carry 1 , Sum 1 ) using equation (5) . The last CSA calculates (Carry 2 , Sum 2 ) using equation (6) .
The multiplexer in Fig.3 plays an important role in changing the working mode. In the initialization stage, u 1 is strobed into the last CSA via the multiplexer in Fig.3 , while in the working stage, 0 is strobed. In this way, when in the working stage, the multiplexer can bypass the circuit(denoted with the dotted box in Fig.3 ) which special to the initialization stage. Since in the initialization stage, the LFSR is updated every other clock cycle, the expected value of S 16 required in the initialization stage is guaranteed.
At the last part of the Pipeline Stage 1, two extra 31-bit wide registers are used to store the intermediate results Carry 2 andSum 2 of the last CSA. Here we do not calculate the sum of Carry 2 and Sum 2 directly in the Pipeline stage 1, because if we do, the path in the Pipeline Stage 1 will be increased by an extra 31-bit wide addition.
In the Pipeline Stage 2, the CSA component is used to calculate (Carry 3 , Sum 3 ) using equation (7) . At the end of this stage, the modulo 2 31 − 1 adder with Method 2 is used to derive S 16 which is the value of S 15 in the next iteration. When the value of S 15 is figured out, the checking step is needed to guarantee that the value of new S 15 is in the set {1, 2, · · · , 2 31 − 1}. If this checking step was included in this stage, it would extend the path in the Pipeline Stage 2. However, Zhang et al [7] proved this step can be ignored in the hardware implementation of ZUC. 2) The working stage: The operations of the working stage in the initial two clock cycles are different from the latter ones in that the pipeline is constructed during the first and second clock cycle. In the first clock cycle, Pipeline Stage 1 calculates (Carry 1 , Sum 1 ) using equation (5) which is the major part of S 16 of the next iteration. At the end of the first clock cycle, all cells of the LFSR, except the fifteenth shift right to update the LFSR state, while the content of the fifteenth in the LFSR remains unchanged.
In the second clock cycle, Pipeline Stage 2 starts running and calculates the value of S 16 using equation (7) . At the same time, the major part of S 17 , the required value of S15 after two iterations is calculated in Pipelined Stage 1 using equation (5) . At the end of the second clock cycle, the value of S 16 is simultaneously written into the fifteenth and fourteenth cells of the LFSR to update both contents, when other cells of the LFSR shift right to update the contained values. After those two clock cycles, the pipeline is constructed and begins to work, outputting a 32-bit key each clock cycle.
III. EVALUATION AND ANALYSIS

A. Evaluation result of the two-stage pipeline ZUC in FPGAs
In order to verify the correctness and evaluate the performance of our architecture in FPGAs platform, we implement the two-stage pipeline architecture in Verilog HDL and map it into Virtex-5 XC5VLX110T-3 and Virtex-6 XC6vlx75t-3 FPGA. The synthesized tool is ISE 11.5. The result of performance (in terms of throughput), consumed area (in terms of Xilinx FPGA slices), the ratio of throughput to area is given in Table. 2.
As shown in the Table.2, the new architecture increases the throughput approximately by 45% compared with the latest and best implementation [7] , and particularly it also saves nearly 12% of hardware resources.
B. Comparison with Existing Designs in ASIC 1) Comparison with Academic Designs in ASIC:
In order to compare with the existing designs in the ASIC, the gatelevel synthesis was carried out using Synopsys Design Compiler Version G-2012.06-SP5, using topographical mode for the 65 nm technology. The only hardware realizations for ZUC have been done in the ASIC [2] so far. In order to compare with the commercial designs, the author used the best performance library in 65 nm technology for the sake of fairness in their paper. Therefore, in our implementation, we also used the best performance library of 65 nm target technology. The area results are reported using equivalent 2-input NAND gates.
As shown in Table. 3, our new architecture improves the throughput significantly comparing with the exciting designs. Since THREE-ZUC did not give the detailed information of the library which they used, we give the results synthesized [5] Virtex-5 XC5VLX110T-3 108 356 slices 3.4Gbps 9.7 Zhang et al [7] Virtex-5 XC5VLX110T-3 172 395 slices 5.5Gbps 13.9
by two different TSMC 65nm target libraries in Table. 3. The rough analysis below shows the reason why our design can reach such high performance.
For the sake of simplicity, here we define a full adder delay as a unit time, denoted Δt. We assume that the adder is following CPA architecture. Here we ignore the routing delay. The 31-bit CPA delay is about 31Δt. The delay of one stage CSA is about Δt. A multiplexer delay is about Δt, the minimum delay of the SBox is about 8Δt. So the critical path of our architecture is in the nonlinear function.
According to the above definition and comparison result in Table 1 , the total delay of the path in nonlinear function Function is about 41Δt = 32Δt + 8Δt + 1Δt, assuming XOR = 1 * Δt. The total delay of the critical path in our work is about 44Δt = 31Δt + 8Δt + 5Δt. The total delay of the critical path in THREE-ZUC is about 125Δt = 31Δt + 31Δt + 31Δt + 31Δt + Δt. From this point of view, the designs following our architecture will perform better than that of THREE-ZUC in the ASIC platform. The delay of the critical path of ours is about one third of THREE-ZUC, that means the throughput of our architecture is approximately three times the value of THREE-ZUC. Since the CLA uses more hardware resources than CPA, if our architecture and THREE-ZUC both use the adder with CPA architecture to save the hardware resources, the throughput of our design will be approximately three times the throughput of THREE-ZUC.
In fact, with the best performance constraint, the synthesized tool uses Carry Look-ahead Adder(CLA) architecture to implement addition, thus the delay of the 32-bit adder is much shorter than that of adder with CPA architecture. From the synthesized result of our design with TSMC-LP 65nm library and TSMC-GP 65nm library, our design can increase the throughput by 2 times and 2.7 times respectively compared with that of THREE-ZUC.
In reference to consumption of hardware resources, our architecture will use less hardware resources. Compared with THREE-ZUC, our new pipeline architecture utilizes Method 3 to calculate modulo 2 31 − 1 addition. In this way, the new architecture can save four 31-bit adders, and moreover, since only two stages in our pipeline, less hardware resources are needed to store the intermediate results. These points make our design save much more hardware area than that of THREE-ZUC.
2) Comparison with Commercial Designs in the ASIC:
In the commercial area, both IP Cores Inc. [6] and Elliptic Tech Inc. [10] provide ZUC IP core in the 65nm ASIC technology, and neither of them releases their architecture. As far as we know, the best implementation of ZUC in the ASIC is given by IP cores Inc. This commercial ZUC IP core is released later than that of THREE-ZUC. IP Cores Inc. only claims that the best performance of their ZUC IP core can up to 40 Gbps in TSMC 65nm technology.
IV. CONCLUSION
In conclusion, we proposed a two-stage pipeline architecture of stream cipher ZUC in hardware. Compared with the previous works, the new architecture increases the throughput significantly and saves a lot of hardware resources in FPGAs and ASICs. As the commercial IP companies have not uncover their design, we hope this architecture could be a standard for hardware implementation of ZUC.
V. ACKNOWLEDGMENT
Cunqing Ma is the contact author of this paper. The work is supported by a grant from the Strategy pilot Project of Chinese Academy of Science (No.Y2W0012203).
