We propose a pipelined division architecture for low-power ECC applications, which is based on partialdivision on group basis and lookahead technique exploiting the linearity in finite field arithmetic. The throughput is one division per clock regardless of the degree of the dividend polynomial. The salient feature of this architecture is that it leads very low power-delay product. To verify the relative performance of the proposed division architecture over the conventional one using LFSR, three RS and BCH code applications were fabricated using O.Qtm double metal CMOS technology. Experimental results show about 32, 65, 67 times improvement in power consumption compared with conventional one using LFSR.
II. New Division Algorithm based on LAPR
While the conventional division algorithm dots symbol or bit basis serial processing, our division algorithm processes on group basis parallel processing. It starts from the definition of P(x) as the long arbitrary dividend polynomial of degree n and M(x) as the fixed divisor polynomial of degree k, he,, P(X) = 2 pia? and M(x) = ~mixi , If we define q
as the maximum number that satisfies n 2 q(k -t I) -l-k then the elements in the dividend polynomial can bc grouped into q-t-2 orthogonal groups as follows :
P(x) = 2 4 (x)P+l) + P-, (x) . All of the groups i=O Pi(x) for q 2 j 2 0 has the same format as S(X) defined as S(X) = (i SiXi >Xk .
i=O ) S-l(X) PI-l(x)
p,-,(x) __---------* I ? i Figure  1 shows our division algorithm schematically for hardware implementation. Qg(x) and R,(x) are the quotient and the remainder, respectively, resulting from P,(x) / M(x).
l------------

P' j(X)
for q-l 2 j2 0 is the sum of 1 left symbol shift of Rj + I(X) and pi(x) . Here, to notice is that, in finite field, adding two symbols or polynomials with the same degree does not produce carry, leading to the resulting polynomial P' j(x) has the same format as Pi(x) . All of the Q(X) and Rj(x) for q-12 j 2 0 are the quotients and the remainders resulting from P', (x) / M(x). We define them as partial-qutient and partail-remainder respectively, since those are the results from a partial-division. The overall quotient of P(x)/M(x) is the weighted sum of all the partial-quotients a(x) for 4 2 j 2 0 and the overall remainder is the sum of Ro(x) and last group P -l(x).
Since all of the Pi(X) for 4 2 j 2 0 has the same format as S(x), all of the a(x)
and Rj(x) can be obtained by looking the results from s(x) / M(x) using identical circuits.
To obtain the result from S(x) / Wx) by circuits with less complexity and also by systematic way, we exploited the linearity of the finite field arithmetic [l] . That is s(x) / M(x) is the same as the linear sum of each element in S(x) divided by M(x). For one simple example in the binary field, if the divisor polynomial is M(x) = x6 +x4 +x2 +X + I then S(X) can be expressed as follows: s(x) = (~sixi)x6
. By i=O exploiting the linearity of the finite field arithmetic, all the necessary information to form the lookahead circuits can be listed as shown in Table I . By superposing the result in Table I we can obtain an optimized lookahead circuit for partial-remainder and partial-quotient as shown in Figure 2 . One last thing to notice is, since our algorithm based on LAPR does not need partial-quotients to proceed the division process, partial-quotient lookahead circuitry can be completely eliminated unless application to apply needs quotient explicitly. M(x) =x6 +x4 +x2 +x+1
III. Division Architecture based on LAPR
Noting the inherent regularity and feedforward natures of our algorithm, we make it fully be pipelined. Figure 3 shows the block diagram of the pipelined architecture based on LAPR. Here, the block FIRST is the register for the first group P,(x). The architecture. Each group in the dividend polynomial is LAPR is shown in Figure 5 . It uses single ccl1 inserted one by one sequentially to its own specific stage recursively to perform the division process based on from the first to the last. Each group in the next dividend LAPR. Every (q+2) cycles, one remainder and one polynomial can be inserted as soon as the group of the quotient are produced. Although this is slower than the present dividend polynomial of that stage is processed.
pipelined architecture shown in Figure 3 , as far as the After (q+2) cycles, all the blocks in Figure 3 operate authors know, it is still faster than any other division simultaneously so that the throughput of this pipelined architecture is 1 remainder and 1 quotient per clock cycle. An area efficient sequential architecture based on architecture ever reported. 
Q(x)
____________________------------------
IV. Experimental Verification and Performance Comparisons
To show the superiority of the proposed architecture based on LAPR compared with the conventional one using LFSR, in terms of speed, area, and power consumption, we designed some popularly used BCHYRS coding applications in COMPASS ASIC development environment using 0.8pm double metal CMOS technology. Three applications: 1) (32,28) RS encoder, 2) (6351) BCH encoder, 3) syndrome generator for (6351) BCH decoder, were designed as benchmark circuits to verify the relative performance of the proposed division architecture over the conventional LFSR one. The (32,28) RS code in GF(2m) and the (6351) BCH code are now being used in CD(Compact Disk) error correction coding [6] and AMPS(Advanced Mobile Phone Service) cellular phone respectively. The chip microphotograph is shown in Figure 6 .
Figure 6 Photo micrograph of the fabricated chips
The experimental results are summarized in Table II . The clock frequency used to obtain the same throughput(500K div/sec) is shown in the second column. Power consumption with supply voltage 5V, is measured and listed in the 4th column. It indicates that the pipelined architectures based on LAPR show 17, 28, 29 times improvement in power consumption compared with those using LFSR. The corresponding improvement for sequential architectures based on LAPR are 10, 13, 18 respectively. To show the power reduction that can be obtained by the architecture driven voltage scaling, we measured power consumption at the minimum supply voltage at which circuits are in proper operation. Since reducing the supply voltage comes at the cost of increased gate delays, as the used clock speeds are higher, lower functional throughput is inevitable. The 5th column in Table II shows this minimum power consumption. It indicates that further power reduction can be obtained by voltage scaling. Pipelined architectures based on LAPR show 32,65,67 times improvement in power consumption compared with those using LFSR. The corresponding improvements for sequential architectures based on LAPR are 14, 22, 28 respectively. To show the power efficiency in terms of energy aspect, the normalized power-delay product is depicted on Figure 7 . All the circuits are in operation at 5V supply voltage and 1OMHz clock frequency. It indicates that pipelined and sequential architecture based on LAPR has very small power-delay product compared with conventional one using LFSR. We also can see, at identical clock frequency, fully-pipelined LAPR architecture produce orders of magnitude big boost in speed for very little power cost. . .' 7. -I_ _r-,_ ,L,~~~*&&+s':;.~,u~ .:-p;< .r-nTb*dc. Figure 7 Normalized power-delay product
V. Conclusion
We proposed long polynomial division architectures based on LAPR division algorithm. Both the partial-division on group basis and lookahead technique exploiting the linearity of the finite field arithmetic, enables complete elimination of polynomial multiplication leading to highly increased throughput per unit time. Experimental verification for three benchmark circuits show that at identical throughput, pipelined architecture based on LAPR consumes about 32, 65, 67 times smaller power compared with conventional one using LFSR. Since proposed division algorithm based on LAPR is efficient, regular and easily expandable, it can be used directly in VLSI implementation of various ECC applications where high-speed and/or low-power is dictated for application to communication, optical disks, portable equipment and computer systems.
