Introduction
Messages transported through an Asynchronous Transfer Mode (ATM) network must be segmented into short, fixed-length packets, called cells, at the source and reassembled at the destination. Several "adaptation layers" are defined by the ATM standards [l] , which specify how the process is carried out. ATM Adaptation Layer 5 (AALS) is the simplest way of handling the segmentation and reassembly process itself. It also makes better use of the bandwidth, because, unlike the other adaptation layers, it does not require overhead in any of the cells, except for some bytes in the information field of the last one. For these reasons, AAL5 tends to be the preferred method for breaking messages into cells. Because data integrity must be ensured end to end, a standard cyclic redundancy checking (CRC) technique is utilized on the entire message, and a frame check sequence (FCS) is added to the data and transported with the last cell. At the remote end, while reassembly is performed, the message is checked for data integrity. The polynomial chosen by the relevant standards bodies to implement the AAL5 CRC is Topyright 1997 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions, of thrs paper may be copied or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor.
the same 32-degree polynomial used for the fiber distributed data interface (FDDI) [2] : G*(x) = x3* + xZ6 + xZ3 + x** + x16 +x1* + X " + X " + x8 + X 7 + X 5 + X 4 + X 2 + X + 1
The large number of terms (15) of this polynomial makes it more difficult to implement than the other CRCs in use, especially when higher-speed lines are considered. Hence, the intent of AAL5 to provide a simple and efficient segmentation and reassembly (SAR) process is somewhat offset by the need to compute and check a complex degree-32 CRC. This paper proposes a method of simplifying the computation in order to facilitate implementation of circuitry for high-speed CRC computation in standard CMOS technology.
State of the art
Contrary to CRC standards (see for example [2] ), each of which describes its particular CRC implementation in the form of a linear feedback shift register (LFSR), in which only one bit at a time can be handled, all of the methods for expediting CRC calculation known to the author tend to propose byte-wise processing. (One of the very first papers on this is [3] . Although it was published much later, [4] is more frequently cited in the relevant literature.) However, we assert that the method of [ 5 ] , which proposes eliminating the shift-register model and handling the computation directly according to the mathematical basis of CRCs (the algebra of polynomials) permits achieving much better results as far as hardware implementation is concerned. In [5] , the per-byte (8-bit byte) computation described requires the equivalent of 114 two-input XORs and an eight-input XOR operator to compute the next bit values.
This may still be too much, however, for the very highspeed computation required to process AAL5 ATM connections flowing through OC-12 lines, for instance. One byte is received approximately every 13 ns for an OC-12 line (622 X lo6 bits/s) and every 3 ns for an OC-48 line (2.4 X lo9 bits/s). Although not all of the traffic of such lines is likely to be AAL5 connections that must be segmented or reassembled, such numbers tend to indicate that the instantaneous processing capability may have to be very high not to degrade the overall performance at a network node.
Simplifying the calculation
All of the methods for computing CRCs known to the author, including the one of [ 5 ] , have in common the process of dividing the message by the CRC polynomial G(x) chosen by the relevant standardization group to perform the calculation. The new concept developed in this 706 paper consists of carrying out this calculation in two steps:
1. Checking and generating the CRC is done with another polynomial, M(x), M(x) = G(x) X P(x), except at the final step. This polynomial must be a multiple of the CRC polynomial G(x), in order that the remainder of the first division of the message by M(x) be divisible, in turn, by G(x). P(x) is a polynomial of degree as low as possible to keep the degree of M(x) low, while it must be chosen so that the resulting polynomial, M(x), has fewer terms than G(x) in order to simplify the first division. The desirable structure for M(x), to make calculation easier, is further discussed in the next section. 2. The result of the first division, performed on all of the ATM cells constituting the message, is a Jixed-length vector with degree equal to that of M(x). This vector must then be divided only once by G(x) to obtain the final result.
Making M(x) "simple"
The polynomial M(x) = G(x) X P(x) is said to be a simple polynomial if it has fewer terms than G(x). Not any simple polynomial is satisfactory, however, because it is desirable to have the terms well spread out between the maximum (the degree of the polynomial) and minimum (xo, or 1) terms. If the calculation is carried out on a perbyte basis, the powers of the terms should ideally be at least 8 bits apart, so as not to "overlap" in the calculation.' For instance, the following multiple of the CRC-32 polynomial G(x),
which has only seven terms, is not as good as the following (referred to hereafter as
which has, however, one more term. This is because in this second instance, the powers of the terms are all at least 8 bits apart. With this polynomial, the first division can be carried out 8 bits at a time with only two-input XORs by the state machine shown in Figure 1 . The method is the one described in [5] , in which calculations are done in the algebra of polynomials modulo G*(x). Computing is done here modulo so that the result of any operation is a vector that is no more than 123 bits long. To implement the state machine of Figure 1 requires only 56 two-input XORs and 123 latches. The other 67 bits of the next intermediate result are simply shifted bits (by eight positions) of the current result. This permits the state machine to operate at a very high rate, so that the calculation can keep up with the very high speed of optical communication lines commonly used nowadays. This speed is achieved at the expense of more bits to process in parallel and the need to store a wider vector (123 bits instead of 32) with the intermediate result of the computation in progress. This is not really a drawback at present, when gate arrays with more than 100 000 gates are commonly available. Hence, the proposed scheme allows one to trade the size of the vector used to manipulate (and store in latches) for processing speed.
input and four-input XORs) can be used while the required processing speed is achieved. Two other polynomials to implement the above computation scheme are listed in Table 1 , along with MIz3 and G*(x).
Other compromises are possible if wider XORs (three-

Final division
According to the scheme described here, the final division must still be carried out with G(x). Because this second and final division is now applied to a short, fixed-length vector (regardless of the length of the initial message), techniques that are not generally practicable with CRCs, because the message can be of any size, may be considered. Among them, the simplest consists of implementing the method always used with errorcorrecting codes (ECCs) employed to improve memory reliability, thus working on a fixed-size word. A matrix can be devised (the H matrix, in ECC jargon) and implemented in the form of a combinatorial array that performs the final division. The input to the array is the remainder of the first division (for instance, a 123-bit vector if MlZ3 is selected), and the output is the 32-bit vector remainder of the division by C*(x).
G(x).
The method for doing it can be found in [6] and in many other publications that deal with ECCs. For instance, the corresponding H matrix for is given in Table 2 . The 53-bit input vector that is the result of the division by M53 (indexed 0 to 52) is applied to the 53-column matrix so as to compute a bit value for each of the 32 rows. The 1s in each matrix row represent the bits of the input vector for which parity must be computed in order to get the 32-bit vector (indexed 0 to 31) that is the result of the division by G * ( x ) . The column labeled "XOR inputs" in Table 2 indicates how many bits of the input must be combined to compute the bit of the corresponding row. A 13-input XOR is required. The speed of such an array can by no means match the cycle time of the state machine previously described; the 
equivalent of several cycles is necessary to generate the result. Taking into consideration, however, that this calculation is done only once at the end, one realizes that the overall computation is much faster than with traditional methods, even if short messages (down to single-cell messages) are considered.
Method summary
The whole computation scheme is summarized hereafter for MIz3. As an example, let us assume that the message is one kilobyte long. The state machine cycle time to process one byte can be as low as 10 ns (a 100-MHz state machine) with a two-input XOR. Thus, the 48 bytes of each ATM cell are calculated in 480 ns (one cell every 700 ns at 622 megabits per second). The complete message, which comprises 21 cells, requires roughly 10 ps plus the final division, which can be done in five cycles or less (i.e., 50 ns maximum). Thus, the final division accounts for only 0.5% of the total calculation time in this first example. For a single-cell message (the worst case), which is processed in 480 ns, the final division accounts for approximately 10% of the total computation time. The two-step scheme described in this paper is summarized in Figure 2 for The three polynomials given in Table 1 are the best that 708 the author was able to find in an exhaustive search up to degree 128 of multiples of G ( x ) . (Only those multiples with consecutive terms having exponents differing by 8 or more were retained. Polynomials M71 and M53 are actually by-products of this search, which was conducted only to find the polynomial of lowest possible degree that permits the first division to be performed per 8-bit byte while requiring only two-input XORs. M'23 is the result of this search.) The use of a two-input operator guarantees that the state machine is intrinsically the fastest possible. An improvement of the scheme described in this paper could come only from a polynomial of degree less than 123, with fewer terms, which would require less hardware but provide no speed advantage.
