This paper presents a theoretical result in the context of realizing high speed hardware for parallel CRC checksums. Starting from the serial implementation widely reported in literature, we have identified a recursive formula from which our parallel implementation is derived. In comparison with previous works, the new scheme is faster and more compact and is independent of the technology used in its realization. In our solution, the number of bits processed in parallel can be different from the degree of the polynomial generator. Lastly, we have also developed high level parametric codes that are capable of generating the circuits autonomously, when only the polyonomial is given.
NO REMAINDER
. CRC description. Ë is the sequence for error detecting, È is the divisor and É is the quotient. Ë½ is the original sequence of bits to transmit. Finally, Ë¾ is the FCS of Ñ bits.
II. CYCLIC REDUNDANCY CHECK
As already stated in the introduction, CRC is one of the most powerful error-detecting codes. Briefly speeking, CRC can be described as follows. Let us suppose that a transmitter, T, send a sequence, Ë ½ , bits. After T sends Ë to R. R divides Ë (i.e. the message and the FCS) by È , using the same particular arithmetic, after it receives the message. If there is no remainder, R assumes there was no error. Fig. 1 illustrates how this mechanism works.
A modulo 2 arithmetic is used in the digital realization of the above concepts [3] : the product operator is accomplished by a bitwise AND, whereas both the sum and subtraction are accomplished by bitwise XOR operators. In this case, a CRC circuit (modulo 2 divisor) can be easily realized as a special shift register, called LFSR. Fig. 2 shows a typical architecture. It can be used by both the transmitter and the receiver. In the case of the transmitter, the dividend is the sequence Ë ½ concatenated with a sequence of Ñ zeros to the right. The divisor is È . In the simpler case of a receiver, the dividend is the received sequence and the divisor is the same È .
In Fig. 2 Fig. 2 are unnecessary if the divisor È is time-invariant.
The sequence Ë ½ is sent serially to the input of the circuit starting from the most significant bit, ¼ . Let us suppose that the bits of the sequence Ë ½ are an integral multiple of Ñ, the degree of the divisor È .
The process begins by clearing all FFs. Then, all bits are sent, once per clock cycle. Finally, Ñ zero bits are sent through . In the end, the FCS appears at the output end of the FFs.
Another possible implementation of the CRC circuit [7] is shown in Fig. 3 . In this paper we will call it LFSR2. In this circuit, the outputs of FFs (after clock periods) are the same FCS computed by LFSR.
It should be mentioned that, when LFSR2 is used, no sequence of Ñ zeros has to be sent through . So, LFSR2 computes FCS faster than LFSR. In practice, the message length is usually much greater than Ñ; so LFSR2 and LFSR have similar performance.
III. RELATED WORKS
Parallel CRC hardware is attractive because, by processing the message in blocks of Û bits each, it is possible to reach a speed-up of Û with respect to the time needed by the serial implementation. Here we report the main works in literature. Later, in Section V we compare our results with those presented in literature.
As stated by Albertengo and Sisto [7] in 1990, previous works [8] - [10] "dealt empirically with the problem of parallel generation of CRCs". Furthermore, "the validity of the results is in any case restricted to a particular generator polynomial". Albertengo and Sisto [7] proposed an interesting analytical approach. Their idea was to apply the digital filter theory to the classical CRC circuit. They derived a method for determining the logic equations for any generator polynomial. Their formalization is based on a Þ-trasform. To obtain logic equations, many polynomial divisions are needed. Thus, it is not possible to write a synthesizable VHDL code that automatically generates the equations for parallel CRCs. The theory they developed is restricted to cases where the number of bits processed in parallel is equal to the polynomial degree (Û Ñ).
In 1996, Braun et al. [11] presented an approach suitable for FPGA implementation. A very complex analytical proof is presented. They developed a special heuristic logic minimization to compute CRC checksums on FPGA in parallel. Their main results are a precise formalism and a set of proofs to derive the parallel CRC computation starting from the bit-serial case. Their work is similar to our work but their proofs are more complex. In 2001 Sprachmann [13] implemented parallel CRC circuits of LSFR2. He proposed interesting VHDL parametric codes. The derivation is valid for any polynomial and data-width Û, but equations are not so optimized.
In the same year, Shieh et al. [14] proposed another approach based on the theory of the Galois field.
The theory they developed is quite general like those presented in our paper (i.e. Û may differ from Ñ). Howerver their hardware implementation is strongly based on lookahead techniques [15] , [16] ;
Thus their final circuits require more area and elaboration time. The possibility to use several smaller look-up tables (LUTs), is also shown, but the critical path of the final circuits grows substantially.
Their derivation method is similar to ours but, as in [13] , equations are not optimized (see Section V).
IV. PARALLEL CRC COMPUTATION
Starting from the circuit represented in Fig. 2 we have developed our parallel implementation of the CRC. In the following, we assume that the degree of polynomial generator (Ñ) and the length of the message to be processed ( ) are both multiples of the number of bits to be processed in parallel (Û). This is typical in data transmission where a message consists of many bytes and the polynomial generator, as desired parallelism, consist of a few nibbles.
In the final circuit that we will obtain, the sequence Ë ½ plus the zeros are sent to the circuit in blocks of Û bits each. After ·Ñ Û clock periods, the FFs output give the desired FCS.
From linear systems theory [17] we know that a discrete-time, time-invariant linear system can be expressed as follows:
where is the state of the system, Í the input and the output. We use , , À, Â to denote matrices, and use , , and Í to denote column vectors.
The solution of the first equation of the system (1) is:
We can apply eq. (2) to the LFSR circuit ( Fig. 2) . In fact, if we use¨to denote the XOR operation, and the symbol ¡ to denote bitwise AND, and to denote the set 0,1 , it is easy to demonstrate that the structure ¨ ¡ is a ring with identity (Galois Field GF(2) [18] ). From this consideration the solution of the system (1)(expressed by (2)) is valid even if we replace multiplication and addition with the AND and XOR operators respectively. In order to point out that the XOR and AND operators must be also used in the product of matrices, we will denote their product by ª.
Let us consider the circuit shown in Fig. 2 . It is just a discrete-time, time-invariant linear system for which: the input Í´ µ is the -th bit of the input sequence; the state represents the FFs output and the vector coincides with , i.e. À and Â are the identity and zero matrices respectively. Matrix and are chosen according to the equations of serial LFRS. So, we have:
where Ô are the bits of the divisor È (i.e. the coefficients of the generator polynomial).
When coincides with Û, the solution derived from eq. (2) with substitution of the operators, is:
where ´¼µ is the initial state of the FFs. Considering that the system is time-invariant, we obtain a recursive formula:
where, for clarity, we have indicated with ¼ and , respectively the next state and the present state of the system, and
etc., where are the bits of the sequence Ë ½ followed by a sequence of Ñ zeros.
This result implies that it is possible to calculate the Ñ bits of the FCS by sending the · Ñ bits of the message Ë ½ plus the zeros, in blocks of Û bits each. So, after ·Ñ Û clock periods, is the desired FCS.
Now, it is important to evaluate the matrix Û . There are several options, but it is easy to show that the matrix can be constructed recursively, when ranges from 2 to Û:
This formula permits an efficient VHDL code to be written as we will show later.
From eq. (5) we can obtain Û when Ñ is already available. If we indicate with È ¼ the vector
where Á Ñ Û is the identical matrix of order Ñ Û. Furthermore, we have:
So, Û may be obtained from Ñ as follows: the first Û columns of Û are the last Û columns of
The upper right part of Û is Á Ñ Û and the lower right part must be filled with zeros.
Let us suppose, for example, we have È = 1,0,0,1,1 . It follows that:
, after applying eq. (5) we obtain: 
As indicated above, having available, a power of of lower order is immediately obtained. So, for example:
The same procedure may be applied to derive equations for parallel version of LFSR2. In this case the matrix is È ¼ and equation (4) becomes:
where
A. Hardware realization
A parallel implementation of the CRC can be derived from the above considerations. Yet again, it consists of a special register. In this case the inputs of the FFs are the exclusive sum of some FF outputs and inputs. In the appendix, the interested reader can find the two listings that generate the correct VHDL code for the CRC parallel circuit we propose here.
Actually, for synthesizing the parallel CRC circuits, using a Pentium II 350 MHz with 64 MB of RAM, less than a couple of minutes are necessary in the cases of CRC-12, CRC-16, and CRC-CCITT.
For CRC-32 several hours are required to evaluate Ñ by our synthesis tool; We have also written a MATLAB code that is able to generate a VHDL code. The code produces logic equations of the desired CRC directly; Thus it is synthesized much faster than the previous VHDL code.
B. Examples
Here, our results, applied to four commonly used CRC polynomial generators are reported. As we stated in the previous paragraph, Û may be derived from Ñ , so we report only the Ñ matrix. In order to improve the readability, matrices Ñ are not reported as matrices of bits. They are reported as a column vector in which each element is the hexadecimal representation of the binary sequence obtained from the corresponding row of Ñ , where the first bit is the most significant. For the example reported above we 
V. COMPARISONS
Albertengo and Sisto [7] based their formalization on Þ-transform. In their approach, many polynomial divisions are required to abtain logic equations. This implies that it is not possible to write synthesizable VHDL codes that automatically generate CRC circuits. Their work is based on the LFSR2 circuit. As we have already shown in Sect. IV, our theory can be applied to the same circuit.
However eq. (8) shows that, generally speaking, one more level of XOR is required with respect to the parallel LFSR circuit we propose. This implies that our proposal is, generally, faster. Further considerations can be made if FPGA is chosen as the target technology. Large FPGAs are, generally, based on look-up-tables (LUTs). A LUT is a little SRAM which usually has more than two inputs (tipically four or more). In the case of high speed LFSR2 there are many two-input XOR gates (see [7] page 68). This implies that, if the CRC circuit is realized using FPGAs, many LUTs are not completely utilized. This phenomenon is less critical in the case of LFSR. As a consequence, parallel LFSR realizations are cheaper than LFSR2 ones. In order to give some numerical results to confirm our considerations, we have synthesized the CRC32 in both cases. With LFSR we needed 162 LUTs to obtain a critical path of 7.3 ns, whereas, for LFSR2, 182 LUTs and 10.8 ns are required.
The main difference between our work and Braun et al's [11] is the dimension of the matrices to be dealt with and the complexity of the proofs. Our results are simpler, i.e., we work with smaller matrices and our proofs are not so complex as those present in [11] .
McCluskey We have compiled VHDL codes reported in [13] using our Altera tool. The implementation of our parallel circuit usually requires less area (70-90±) and has higher speed (a speedup of 4 is achieved). For details see Table I . There are two main motives that explain these results: the former is the same mentioned at the beginning of this section regarding the differences between LFSR and LFSR2 FPGA implementation. The latter is that our starting equations are optimized. More precisely, in our equation of Ü ¼ , term Ü appears only once or not at all, while, in the starting equation of [13] Ü may appear more (up to Ñ times), as it is possible to observe in Fig.4 in [13] . Optimizations like Ü ¨Ü ¼ and Ü ¨Ü ¨Ü Ü must be processed from a synthesis tool. When Ñ grows, many expressions of this kind are present in the final equations. Even if very powerful VHDL synthesis tools are used, it is not sure that they are able to find the most compact logical form. Even when they are able to, more synthesis time is necessary with respect to our final VHDL code.
In [14] a detailed performance evaluation of the CRC-32 circuit is reported. For comparison purposes we take results from [14] when Ñ Û ¿¾. For the matrix realization they start from a requirement of 448 2-input XORs and a critical path of 15 levels of XORs. After Synopsys optimization they obtain 408 2-input XORs and 7 levels of gates. We evaluated the number of required 2-input XORs starting from the matrix Û , counting the ones and supposing the realization of XORs with more than two inputs with binary tree architectures. So, for example, to realize a 8-input XOR, 3 levels of 2-input XORs are required with a total of 7 gates. This approach gives 452 2-input XORs and only 5 levels of gates before optimization. This implies that our approach produces faster circuits, but the circuits are a little bit larger. However, with some simple manual tricks it is possible to obtain hardware savings. For example, identifying common XOR sub-expressions and realizing them only once, the number of required gates decreases to 370. With other smart tricks it is possible to obtain more compact circuits. We do not have at our disposal the Synopsys tool, so we do not know which is the automatic optimization achievable starting from the 452 initial XORs.
VI. ACKNOWLEDGMENTS
The authors wish to thank Chao Yang of the ILOG, Inc. for his useful suggestions during the revision process. The authors wish also to thank the anonimous reviewers for their work, a valuable aid that contributed to the improvement of the quality of the paper. o n s t a n t CRCDIM: i n t e g e r : = 1 6 ; 7 c o n s t a n t CRC : s t d l o g i c v e c t o r (CRCDIM downto 0 ) : = 8 CRC16 ; 9 c o n s t a n t DATA WIDTH : i n t e g e r ra n g e 1 t o CRCDIM: = 1 6 ; 10 t y p e m a t r i x i s a rra y (CRCDIM 1 downto 0 ) 
B. Matlab code
In Fig. 7 we report the Matlab code, named crcgen.m, used to directly produce the VHDL listing of the desired CRC where only the logical equations of the CRC are present. This code is synthesized much faster than the previous one. In order to work correctly, the crcgen.m file needs another file, named crcgen.txt; this file contains the first 34 rows of crcgen.vhd.
