We propose a pipelined division architecture for low-power ECC applications, which is based on partialdivision on group basis and lookahead technique exploiting the linearity in finite field arithmetic. The throughput is one division per clock regardless of the degree of the dividend polynomial. The salient feature of this architecture is that it leads very low power-delay product. To verify the relative performance of the proposed division architecture over the conventional one using LFSR, three RS and BCH code applications were fabricated using 0.8pm double metal CMOS technology.
I. Introduction
Division in the finite field GF(2m) is the most important building block in ECC(Error Correction Coding) systems such as BCH(Bose-ChaudhuriHocquenghem) and RS(Reed-Solomon) codes, since these block codings are based on long polynomial divisions[ 11. The conventional Euclidean division architecture in finite field uses LFSR(Linear Feedback Shift Register). However, as the high-speed requirement for real-time audiotvideo coding as well as the low-power requirement for portable applications increase, this serial architecture has shown several limitations as follows. 1)The throughput is limited by the degree of the dividend polynomial.
2)The presence of a global feedback signal imposes severe constraints on the switching speed and necessitates the use of a global clock [2] . 3)This feedback signal limits the degree of parallelism that can be exploited for low-power consumption [3] . 4)The fact that the complete LFSR and serial buffer registers should be clocked for every clock cycle without concerning the change of contents, can not avoid useless power consumption [4] . Therefore, for high-speedlow-power ECC applications, a new division architecture which does not suffer from limitations mentioned above is necessitated.
Permission to make digitayhard copy of all or part of this work for personal 01 classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission andor a fee. 01997 ACM 0-89791-903-3/97/08..$3.50
New Division Algorithm based on LAPR
While the conventional division algorithm does symbol or bit basis serial processing, our division algorithm processes on group basis parallel processing. It starts from the definition of P(x) as the long arbitrary dividend polynomial of degree n and M(x) as the fixed divisor polynomial of degree k, i.e., For one simple example in the binary field, if the divisor
can be: expressed as follows: S(x) = ( c s i x i ) x 6 . By exploiting the linearity of the finite field arithmetic, all the necessary information to form the lookahead circuits can be listed as shown in Table I .
Since all of the e(x) for 6 i=O Table I Division table for lookahead circuit : M ( x ) = X 6 + X 4 + X 2 + X + l
Input

S(X)/X6
1000000 1010010 011110 0100000 0101001 001111 0010000 0010100 101100 0001000 0001010 010110 0000100 0000101 001011 0000010 0000010 101110 0000001 0000001 010111
By superposing the result in Table I we can obtain an optimized lookahead circuit for partial-remainder and partial-quotient as shown in Figure 2 . One last thing to notice is, since our algorithm based on LAPR does not need partial-quotients to proceed the division process, partial-quotient lookahead circuitry can be completely eliminated unless application to apply needs quotient explicitly.
INS [l] a) Lookahead circuit for partial-remainder 
Division Architecture based on LAPR
Noting the inherent regularity and feedforward natures of our algorithm, we make it fully be pipelined. Figure 3 shows the block diagram of the pipelined architecture based on LAPR. Here, the block FIRST is the register for the first group P~(x), The q identical blocks INT are intermediate group registers, which form new intermediate groups P' j ( X ) for -1 2 j 2 0 by adding the partial-remainder from the previous group and the input group P j ( X ) . The block LAST is the remainder register. Adding the partial-remainder from f " o ( x) and P -i(x) forms the overall remainder. All of the group registers can be implemented using only FFs(flip4lops) and EXORs. There are (q+1) identical blocks LOOK-AHEADR and LOOK-AHEADQ that generate the partial-quotient and partial-remainder respectively. Figure  4 shows the operation diagram of the pipelined architecture. Each group in the dividend polynomial is LAPR is shown in Figure 5 . It uses single cell inserted one by one sequentially to its own specific stage recursively to perform the division process based on from the first to the last. Each group in the next dividend LAPR. Every (q+2) cycles, one remainder and one polynomial can be inserted as soon as the group of the quotient are produced. Although this is slower than the present dividend polynomial of that stage is processed.
pipelined architecture shown in Figure 3 , as far as the After (4+2) cycles, all the blocks in Figure 3 . -.
--.
RES6
POUTk.0 . . _ --.
--. 
Experimental Verification and Performance Comparisons
To show the superiority of the proposed archilecture based on LAPR compared with the conventional one using LFSR, in terms of speed, area, and power consumption, we designed some popularly used BCWRS coding applications in COMPASS ASIC development environment using 0.8pm double metal CMOlS technology. Three applications: 1) (32,28) RS encoder, 2) (63,51) BCH encoder, 3) syndrome generator for (6331) BCH decoder, were designed as benchmark circuits to verify the relative performance of the proposed division architecture over the conventional LFSR one. The (32,28) RS code in GF(2m) and the (6331) BCH code are now being used in CD(Compact Disk) error correction coding [6] and AMPS(Advanced Mobile Phone Service) cellular phone respectively. The chip microphotograph is shown in Figure 6 .
Figure 6 Photo micrograph of the fabricated chips
The experimental results are summarized in Tabla 11. The clock frequency used to obtain the same throughput(500K div/sec) is shown in the second column. Power consumption with supply voltage 5V, is measured and listed in the 4th column. It indicates that the pipelined architectures based on LAPR show 17, 28, 29 times improvement in power consumption compared with those using LFSR. The corresponding improvement for sequential architectures based on LAPR are 10, 13, 18 respectively. To show the power reduction that can be obtained by the architecture driven voltage scaling, we measured power consumption at the minimum supply voltage at which circuits are in proper operation. Since reducing the supply voltage comes at the cost of increased gate delays, as the used clock speeds are higher, lower functional throughput is inevitable. The 5th column in Table I1 shows this minimum power consumption. It indicates that further power reduction can be obtained by voltage scaling. Pipelined architectures based on LAPR show 32, 65, 67 times improvement in power consumption compared with those using LFSR. The corresponding improvements for sequential architectures based on LAPR are 14, 22, 28 respectively. To show the power efficiency in terms of energy aspect, the normalized power-delay product is depicted on Figure 7 . All the circuits are in operation at 5V supply voltage and lOMHz clock frequency. It indicates that pipelined and sequential architecture based on LAPR has very small power-delay product compared with conventional one using LFSR. We also can see, at identical clock frequency, fully-pipelined LAPR architecture produce orders of magnitude big boost in speed for very little power cost. circuits show that at identical throughput, pipelined architecture based on LAPR consumes about 32, 65, 67 times smaller power compared with conventional one using LFSR. Since proposed division algorithm based on LAPR is efficient, regular and easily expandable, it can be used directly in VLSI implementation of various ECC applications where high-speed andor low-power is dictated for application to communication, optical disks, portable equipment and computer systems.
