Abstract-Cyclic Redundancy Check (CRC) is often employed in data storage and communications to detect errors. The 3GPP-LTE wireless communication standard uses a 24-bit CRC with every turbo coded frame, thus, the CRC can be exploited to detect residual errors and to enable early stopping of iterations as well. The current state of the art lacks specific CRC implementations for this standard, and most current solutions adopt a fixed degree of parallelism, unsuitable for many turbo decoder architectures. This work proposes a variable parallelism circuit targeting the 3GPP-LTE/LTE-Advanced 24-bit CRC, that can adapt to input data of different sizes. Low complexity is achieved through careful functional sharing among the various parallelisms: comparison with the state of the art shows comparable or superior speed and extremely low complexity.
I. INTRODUCTION
T HE Cyclic Redundancy Check (CRC) is a common technique employed in a variety of fields to detect errors in sequences of bits [1] - [3] . Its straightforward implementation and high reliability has led to ample usage in memories, wired and wireless communications including the 3GPP Long Term Evolution (LTE) standard [4] . In particular, the 3GPP-LTE/LTE-Advanced wireless communication standard relies on turbo codes [5] for forward error correction. Coded frame sizes ranging from 40 bits to 6144 bits with a granularity of down to 8 bits are foreseen: the last 24 bits of each coded frame are the remainder of a 24-bit CRC performed on the remaining bits (known as CRC-24b) [4] . Since the turbo codes decoding algorithm is iterative, the CRC can be used as an early stopping criterion [6] . The CRC is performed on the frame after the error correction process and compared to the received remainder: if they match, the process is stopped, on the contrary, in case of discrepancy, further decoding iterations (each composed of two half iterations) are required. The state of the art is ripe with turbo code decoder designs, and the 3GPP-LTE/LTE-Advanced standard is often Manuscript received January 17, 2014; revised April 15, 2014 ; accepted June 23, 2014. Date of publication July 01, 2014; date of current version July 16, 2014 . This work has been partially funded by theNEWCOM# project, and developed within its work package 2.3.2 "Tools for embedded hardware/software architectures." The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Joseph Cavallaro.
The authors are with the Department of Electronics and Telecommunications, Politecnico di Torino, 10129 Turin, Italy (e-mail: carlo.condo@polito.it; maurizio.martina@polito.it; gianluca.piccinini@polito.it; guido.masera@polito.it).
Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/LSP.2014.2334393 considered in both actual application and cases of study. However, very few designs implement the CRC-based early stopping, and scant details on the implementation are given [7] - [9] . The problem of CRC computation has been analyzed in depth in the past. In addition to serial implementations, many solutions for parallel CRC computation have been proposed [10] - [17] . Research on flexible CRC implementations has been particularly prolific, with programmable circuits that can handle different CRC polynomials [10] , [11] and flexible design methodologies that produce efficient dedicated circuits given a particular CRC polynomial [13] - [15] . In the vast majority of these solutions, the degree of parallelism is fixed at design time. This poses problems in 3GPP-LTE turbo decoders, since the data on which the CRC is performed are often scattered in various memories, that can be of different size from the CRC parallelism, with an uneven usage within the decoder.
This work proposes a CRC circuit targeting the CRC-24b of 3GPP-LTE that can adapt on-the-fly to the size of the incoming data. By observing how the CRC calculation changes with the parallelism, a flexible low-complexity solution has been devised: three implementations are proposed, each based on different design choices. The rest of the paper is organized as follows: Section II describes the devised CRC parallelization process, while Section III details the designed circuit and its modifications. Implementation results are presented and compared to the state of the art in Section IV and conclusions are drawn in Section V.
II. PARALLELIZATION OF THE CRC OPERATION
Let us recall the CRC-24b polynomial for coded frames as defined by the 3GPP-LTE standard:
In the CRC computation, the frame is sequentially divided by the constituent polynomial: if at the end of the division the remainder is equal to the received CRC, the frame is considered correct and the decoding is interrupted. Let be a frame made of bits and a vector of bits holding the current remainder value left-shifted by one position; is the remainder length and in this case . Thus, in the following, and will be referred to as the -position bit of and respectively. The CRC polynomial division is performed in the binary domain as detailed in Alg. 1, where is the binary XOR operation, is the new remainder that is used to update and is the binary representation of , where its 25th, 24th, 7th, 6th, 2nd and first bits are '1' and the others are '0'. The operations in Alg. 1 suppose that the frame bits , without the CRC bits, are shifted into one at a time: this operation would be very time consuming, especially when long frame sizes are involved. Assuming to perform one run of the CRC loop (line 2-10 in Alg. 1) in one clock cycle, we define as as the number of cycles necessary to complete the computation. Depending on the decoder architecture, it is possible to perform the CRC computation every iteration or every half iteration. Considering the most common parallel turbo decoder architectures, a half iteration requires at least clock cycles, where is the number of decoder cores. Since the CRC computation might need to be completed in a half iteration, recent LTE/LTE-Advanced decoder implementations, that use or , would be forced to wait for the CRC completion. Indeed, even when , a serial CRC calculation could not be sustained. As an example, for and allowing up to twelve half iterations (i.e. six iterations, a conservative choice), to obtain a throughput of 450 Mb/s the decoder would need to run at a clock frequency of 5.3 GHz, that is not realistic. 
10: end for
A common approach to speed up the CRC is the parallelization of the polynomial division [13] by unfolding the operation described in Alg. 1, meaning that P bits of the frame are loaded at each cycle in the shift register containing , leading to . The inherent degree of parallelism in CRC-24b is , that is . However, 24 is not always an implementation-friendly degree of parallelism. Indeed, the parallel computation requires a 24-bit vector containing part of the hard decision vector , but in 3GPP-LTE/ LTE-Advanced the frame size has a granularity of 8 bits. For this reason, data are usually stored in memories that have a width of one, two or four bytes, and assembling 24-bit vectors is not straightforward. Moreover, there is no guarantee that all windows are composed of the same number of bits: border windows in parallel decoders can be truncated. The proposed low-cost CRC circuit has a variable degree of parallelism, that can adapt on-the-fly to the size of the incoming data and is suitable for diverse decoder structures. Table I reports the equations necessary for the computation of the updated remainder vector , bit per bit, in case of three different degrees of parallelism (8,16 and 24). They have been obtained by unfolding the serial operation described in Alg. 1. The 24-bit vector contains the hard decision bits: since their number varies according to the degree of parallelism, the useful bits are the 8, 16 or 24 most significant bits. In the following, when square brackets are used, the operation must be applied among the values in the indicated interval, that is (2) It has been noticed that many of the of each parallelism degree are contained in those required for the higher degrees of parallelism. Most of the superpositions between the different parallelisms are explicitly shown in Table I: and are the -th bit of the remainder as computed with a CRC parallelism 8 (P-8) and 16 (P-16) respectively. To this purpose, the table entries highlighted in light gray are those equations that are completely included in the degree of parallelism just higher, whereas dark gray shading means a partial inclusion. These inclusions allow for an interesting flexible implementation of a variable parallelism CRC at very low cost, as it will be detailed in the next sections.
III. VARIABLE PARALLELISM CRC CIRCUIT
The proposed variable parallelism CRC system is pictured in Fig. 1 . At every clock cycle, a vector of new data is injected in the system and stored in a register . Although is 24-bit long, can be 8, 16 or 24 according to the selected parallelism. In case of P-8 or P-16, is stored in the 8 or 16 most significant bits of . Together with , the current remainder is given to the XOR network: here the equations shown in Table I are implemented. Three possible remainders are obtained, and the correct one is selected according to the current signal. As changes at every clock cycle, also the degree of parallelism can change: for example, let us assume a two-core decoder architecture that stores hard decision bits in 16-bit registers. When considering bits, each core fills two 16-bit registers, while only 8 bits of a third register are used. Consequently, after two P-16 operations, a single P-8 is required, immediately followed by the first P-16 of the second core.
A. Low Complexity XOR Network
The XOR network is structured as a combinational network of XOR gates that performs the operations needed by the equations in Table I : operations among multiple bits within the same equations are tackled via a binary tree of XOR gates. This structure brings a critical path equal to the deepest tree in the system. Five-level deep XOR trees are present in both P-24 ( , , ) and P-16 ( , ), while in P-8 only four-level XOR trees are found ( , ). Consequently, the critical path is equal to , where is the delay introduced by a two-input XOR gate. To obtain a minimum complexity XOR network for P-8, P-16 and P-24 CRC and at the same time estimate the cost and gain of the proposed variable parallelism architecture, successive optimizations are applied to the remainder computation. These optimizations can be difficult to perform for automated [18] .
• Unoptimizedparallel CRC: the starting point of this analysis is the number of XOR gates required for the operations in Table I without considering the superpositions. The P-8 CRC requires 38 XOR gates, P-16 78 and P-24 105, for a total of 221 XOR gates.
• Intra-parallelism optimized CRC: some of the XOR gates counted in the previous analysis are repetitions of other XOR gates within the same parallelism. By reusing them, P-8 CRC is reduced to 24 XOR gates, P-16 needs only 47, and 73 gates are sufficient for P-24, amounting to a total of 144 XOR gates.
• Inter-parallelism optimized CRC: by sharing XOR gates between P-8, P-16 and P-24 according to Table I , it is possible to substantially reduce the circuit complexity. In particular, assuming P-8 is implemented with 24 XOR gates, only 31 additional gates are needed for P-16. With both P-8 and P-16 present, P-24 can be implemented with 37 XOR gates, with the total gate count descending to 92. As it can be observed, starting from an internally optimized P-24 (73 XOR gates), two degrees of parallelism can be added to the CRC at the cost of 19 additional XOR gates (+26% increment).
B. Pipelined XOR Network
To shorten the critical path and increase the achievable frequency, the XOR network can be pipelined. Four pipeline stages have been added to the previous architecture, separating the levels of all binary XOR trees and propagating signals accordingly, and requiring 154 delay elements in total. They introduce a latency of four clock cycles, but allow to reduce from to . 
C. Extension to 32-bit Parallelism
Even though the inherent degree of parallelism of CRC-24b is , 32-bit registers are common enough in turbo decoders, and can be useful. Implementing a degree of parallelism higher than the CRC parallelism, however, comes at a higher complexity cost than the previous ones. Apart from the additional bits in , the equations for P-32 are much more complex than those presented in Table I , and the degree of superposition among P-32 and the smaller parallelisms is reduced. Indeed, with P-8, P-16 and P-24 already present (92 XORs), 54 additional XOR gates are necessary to implement also P-32, for a total of 144 XOR gates. The critical path becomes , but with P-32 only 192 clock cycles are necessary to perform the remainder calculation on largest turbo code frame size in 3GPP-LTE. 
IV. IMPLEMENTATION
The proposed circuits have been synthesized with Synopsys Design Compiler on a CMOS 90 nm standard cell technology. The smallest area obtained implementing the CRC circuit with P-8, P-16 and P-24 is m , with a target maximum frequency of 1 GHz ( ns), meaning that the correct functionality is guaranteed for frequencies GHz. The vast majority of decoders has a much lower working frequency [8] , [9] : it is consequently improbable that the system critical path resides in the variable parallelism CRC circuit. However, relaxing the area constraint (and thus disrupting the binary tree structure), maximum frequencies as high as 2.5 GHz ( ns) can be obtained with an area of m . Another implementation has been carried out taking into account the pipelined XOR network. The maximum achievable frequency is 5.26 GHz ( ns), obtained with an area occupation of m : as expected, it is roughly five times faster than when enforcing the binary tree structure in the combinational XOR network. Due to the very high frequencies we have obtained, in most implementations the pipelined architecture would be unnecessary, bringing an overhead of four clock cycles of latency and no actual benefit in speed. A final implementation takes in account also the fourth degree of parallelism P-32: maintaining the tree structure in the XOR network, an area of m and a maximum frequency of 833 MHz have been obtained. Table II presents the characteristics of the proposed CRC together with other solutions present in literature: the combinational circuit is labeled Prop-C, the pipelined architecture is Prop-P and the circuit with P-32 is Prop-32. The number of XOR gates needed for each implementation is listed alongside the number of delay elements, the critical path and the number of cycles needed to perform the CRC calculation on bits. Since the proposed circuits implements concurrently different degrees of parallelism that can be selected on-the-fly, the number of cycles depends on the chosen mode (P-8, P-16, P-24 or P-32). To the best of our knowledge, no detailed CRC circuit implementations for 3GPP-LTE exist in the state of the art. Comparison is consequently attempted with other parallel CRC circuits targeting similar polynomials: in particular, the 16-bit CRC-16 and 32-bit CRC-32 are taken in account. CRC-16 and CRC-24b share similar polynomial structures, with a large number of zero coefficients between the 2nd and 3rd highest order nonzero coefficients, while CRC-32 allows to evaluate how circuit complexity rises with the order of the polynomial and is closer to Prop-32. The CRC solution employed in [9] is not included in Table II : the lack of implementation details prevents a fair comparison, but a qualitative estimation is possible. A bit look-up table is necessary for every decoding core to implement the distributed part of the CRC, while the recombination circuit requires a tree of adders that widens and deepens with the increasing of , leading to a fast but costly implementation. The circuit designed in [13] starts from the sequential Linear Feedback Shift Register (LFSR) implementation and applies unfolding, pipelining and retiming to reduce the critical path and increase the parallelism. Our approach is different in concept, since starting from the desired parallelisms we have devised a minimum-complexity circuit sharing as many logic functions as possible. The architecture in [13] relies on a four-level pipeline: it requires a total of 28 delay elements, performing the CRC-16 calculation on bits in cycles, and has a of . The proposed architecture, on the contrary, requires 24 delay elements, and achieves without the need of pipelining. Supposing to use P-16 only, bits are handled in clock cycles. Regardless of the additional degrees of parallelism (P-8 and P-24) and of the more complex polynomial, the proposed architecture requires only 92 XOR gates, against the 137 of [13] . The LSFR-based parallel CRC-16 presented in [14] and [15] rely on a number of XOR gates comparable to both Prop-C and Prop-P, and they need fewer delay elements. However, they require a larger number of cycles to complete the computations, and [15] has a very large . All three approaches [13] - [15] experience a steep increment in complexity w.r.t. CRC-16 when implementing CRC-32. Even though the calculations involved in CRC-32 are comparable in complexity to those involved in P-32 for CRC-24b, Prop-32 requires less than 1/3 of the XOR gates of [13] , [14] and less than 1/4 of [15] . This achievement is even more significant considering that the effectiveness of [13] - [15] is reduced in presence of CRC polynomials of order different from a power of two.
V. CONCLUSION
In this work, a novel circuit for the parallel computation of CRC-24b employed in the 3GPP-LTE/LTE-Advanced standard has been proposed. It is able to perform the CRC calculation with a variable degree of parallelism, that can be changed on-the-fly. Three versions of the circuit have been designed: a low-complexity circuit supporting three degrees of parallelism, a fast pipelined version, and an extended design supporting four different parallelisms. Implemented in CMOS 90 nm technology, they all show very good complexity and speed figures: comparison with the state of the art reveals unmatched on-the-fly adaptivity at an extremely low complexity cost and comparable or superior speed.
