Abstract-We present a fast cyclic redundancy check (CRC) algorithm that performs CRC computation for any length of message in parallel. Traditional CRC implementations have feedbacks, which make pipelining problematic. In the proposed approach, we eliminate feedbacks and implement a pipelined calculation of 32-bit CRC in the SMIC 0.13 µm CMOS technology. For a given message, the algorithm first chunks the message into blocks, each of which has a fixed size equal to the degree of the generator polynomial. Then it computes CRC for the chunked blocks in parallel using lookup tables, and the results are combined together by performing XOR operations. The simulation results show that our proposed pipelined CRC is more efficient than existing CRC implementations.
I. INTRODUCTION
Communication networks use protocols with ever increasing demands on speed. Meeting the speed requirement is crucial because packets will be dropped if the jobs are not completed at wire speed. Recently, as the high throughput required protocols emerged, such as IEEE 802.11n WLAN and UWB (Ultra Wide Band), new protocols with much higher throughput requirement are on the way. In order to support CRC (Cyclic Redundancy Check) calculation for these high throughput standards at a reasonable frequency, processing multiple bits in parallel and pipelining the processing path are desirable. Although there have been algorithms for parallelism in CRC calculation in recent years, they increase the length of the worst case timing path as well as the required area and power consumption. Therefore, we seek an alternative way to implement CRC hardware to speed up the CRC calculation with reasonable area and power consumption.
II. PROPOSED CRC ALGORITHM
For our parallel CRC design, the ordinary serial computation should be rearranged into a parallel configuration. We use the following two theorems to achieve parallelism in CRC computation.
Theorem 1:
] for any k. Both theorems can be easily proved using the properties of GF (2) .
Theorem 1 implies that we can split a message A(x) into multiple blocks, A 1 (x), A 2 (x), . . . , A N (x), and compute each block's CRC independently. For example, suppose that
. + a 0 represents a l-bit message, and that it is split into b-bit blocks. For simplicity, let's assume that l is a multiple of b, namely l = Nb. Then the ith block of the message is a b-bit binary string from the (i − 1)b + 1st bit to the ibth bit in the original message, followed by l − ib zeros, and thus we get
. A i (x) has many trailing zeros, and its number determines the order of A i (x). It means that the length of the message should be known beforehand to have correct A i (x). Thanks to Theorem 2, however, we can compute the CRC of the prefix of A i (x) including the b-bit substring of the original message, and then update the CRC later when the number of following zeros is known.
The first step of our algorithm is to divide a message into a series of n-byte blocks. A single block becomes a basic unit in parallel processing. To make description simple, we assume that the message length is the multiple of n so that every block has exactly n bytes. We will later address the case where the last block is shorter than n bytes.
In our implementation, we perform four CRC calculations simultaneously. A 1 (x) becomes the first block followed by 3n bytes of zeros, A 2 (x) the second block followed by 2n bytes of zeros, A 3 the third block followed by n bytes of zeros, and A 4 the fourth block itself. Combining every CRC using XOR results in the CRC for the first n bytes of the message. This step of processing the first four blocks and getting the CRC is the first iteration. Let's call this CRC CRC 1 . In the second iteration, the next four blocks are processed to produce their combined CRC. Note that this result combined with CRC [x 4·8·n CRC 1 ] is the CRC for the first eight blocks because of Theorem 2. It is implemented in Fig. 1 . It has five blocks as input; four of them are used to read four new blocks from the message in each iteration. They are converted into CRC using lookup tables: LUT3, LUT2, and LUT1. LUT3 contains CRC values for the input followed by 12 bytes of zeros, LUT2 8 bytes, and LUT4 4 bytes. The results are combined using XOR, and then it is combined with the output of LUT4, the CRC of the value from the previous iteration with 16 bytes of zeros appended. In order to reduce the critical path, we introduce another stage called the pre-XOR stage right before the four-input XOR gate. This makes the algorithm more scalable because more blocks can be added without U.S. Government work not protected by U.S. copyright increasing the critical path of the pipeline. With the pre-XOR stage, the critical path is the delay of LUT4 and a two-input XOR gate, and the throughput is 16 bytes per cycle. The architecture in Fig. 1 shows further optimization: the leftmost 4-byte block. Since the CRC of the first block is the first block itself, it can be easily combined with the following four blocks by appending 16 bytes of zeros using LUT4. To exploit this, the first iteration loads the first five blocks from the message. The multiplexer is set to choose the leftmost block. In this way, the CRC for five blocks is calculated in the first iteration. Two registers below the leftmost block are needed to delay for two clock cycles while other blocks' CRCs are combined. From the second iteration, four blocks from the message are loaded, and the multiplexer chooses the result from the previous iteration.
We need to calculate the CRC of every block followed by zeros, whose length varies from 4-byte to 16-byte. For faster calculation, our algorithm employs lookup tables which contain pre-calculated CRCs, and 4 lookup tables are needed. The key point is that although the input length may be as long as 20 bytes, only the first 4 bytes need actual calculation because of Theorem 1. When implemented using a lookup table, however, the CRC for 4 bytes still requires as many as 2 32 entries in the Each lookup table should pre-compute the CRC of each byte followed by a different number of zeros. For the kth block in one iteration, 4k byte zeros should be added in addition, where k = 0, 1, 2, 3. The outputs of lookups should also be combined together using XOR.
When the message size is not a multiple of 4n, the number of bytes processed in each iteration, the architecture in To implement this, we introduce four multiplexers in Fig. 2 .
Since the last block size is between 1 and 16, it can be encoded with four bits, 0000 being size 1, 0001 being 2, and so on. Let this 4-bit encoding be w = w 3 w 2 w 1 w 0 . In the last iteration, 16 bytes consisting of the last block preceded by zeros are loaded. Then the multiplexers select the value from the previous iteration depending on w.
III. EVALUATION
A Verilog implementation of the proposed algorithm has been created for CRC32. We compared our algorithm with an ordinary CRC algorithm [1] and a pipelined CRC algorithm [2] when the size of input is 64 bits. Each of the algorithms was implemented with 1.2 V power supply using the SMIC 0.13 µm CMOS technology. The synthesis results are shown in Table I . Obviously, pipelining or parallel approaches increase throughput at the cost of space. Still, our algorithm achieves 60% more throughput than the previous pipelined algorithm while occupying less area. Another advantage over the compared pipelined algorithm is that our algorithm does not require LFSR logic or inversion operations.
IV. CONCLUSION
We presented a fast CRC algorithm that uses lookup tables and implements a pipelined architecture to perform CRC computation for any length of message in parallel. Given more space, it achieves considerably higher throughput than existing CRC algorithms. Its throughput is also higher than the previous pipelined approach while consuming less space. With little delay increase in the critical path, the throughput can be improved further by increasing parallelism.
