Abstract-With the spread of Reed-Solomon (RS) codes to portable wireless applications, low-power RS decoder design has become important. This paper discusses how the Berlekamp Massey Decoding algorithm can be modified and mapped to obtain a low-power architecture. In addition, architecture level modifications that speed-up the syndrome and error computations are proposed. Then the VLSI architecture and design of the proposed low-power/high-speed decoder is presented. The proposed design is compared with a normal design that does not use these algorithm/architecture modifications. The power reduction when compared to the normal design is estimated. The results indicate a power reduction of about 40% or a speed-up of 1.34.
in xDSL (Digital Subscriber Line) services to protect the data from impulse noise [4] . RS codes are good candidates for use in wireless communication systems as a part of a concatenated coding system, along with convolutional codes [5] , [6] .
An primitive RS code defined in the Galois field has code words of length , where is a positive integer and is the number of information symbols in the codeword. This RS code has minimum distance and has redundant symbols. The generator polynomial of the code is , where is a primitive element in and is an integer constant. This implies that these consecutive powers of are roots of every codeword polynomial. This property has been used to develop many efficient decoding algorithms for RS decoding.
Let denote the error vector. Note that if no error occured at position . Also, denotes the actual value of the error introduced by the channel at position . Assume further that errors have occured at positions . The symbols of the possibly corrupted word received from the channel can be written in terms of the codeword symbols and the error symbols as for . Then the decoding problem is to find the error values and the error locations . The received polynomial can be formed from the received symbols as . The error locator polynomial is defined as . The syndromes are defined as for and written in polynomial form as . The key equation can then be written as (1) where is the error magnitude polynomial with . The solution of this equation plays a pivotal role in the decoding process.
Below, we briefly review and summarize the various RS decoding algorithms. The RS decoding algorithms can be classified as shown in Fig. 1 . On the left side of Fig. 1 , all the algorithms that involve syndrome computation are shown. We will discuss these algorithms first. Once the syndrome has been computed, the next step is to solve the key equation to obtain the error locator polynomial. Peterson [7] proposed a solution to this problem which was later improved and extended by Gorenstein and Zierler [8] . Their approach was to write a set of equations involving the unknown error locator polynomial coefficients and the known syndromes, and then, solve the system of equations for the error locator using direct matrix inversion techniques. These matrix inversions become computationally inefficient for large . It can be mentioned that for small (for example in compact disc systems), closed form expressions for the error locations and error values in terms of the syndromes can be obtained. An efficient technique for finding the error locator was first proposed by Berlekamp [9] . Massey [10] interpreted this algorithm in terms of linear feedback shift register (LFSR) synthesis (this algorithm is referred to in the literature as the Berlekamp Massey algorithm). Sugiyama [11] , [12] recognized that the key equation could also be solved by applying Euclid's greatest common divisor (GCD) algorithm to obtain the error locator polynomial efficiently. Once the error locator polynomial has been obtained, the error values can be computed in the time or frequency domains. In the frequency domain, the error transform is found by recursively extending [13] , [12] the syndromes and the error vector is then found by applying an inverse Fourier transform (FT). In the time domain, the roots of the error-locator polynomial are found by a Chien search [14] . Then, the error values may be found either by solving the linear set of equations directly or by using a technique called Forney's method [15] . Forney's method also requires computation of . Berlekamp's algorithm can be used to compute by using a parallel set of iterations. can also be obtained from Euclid's algorithm by maintaining some additional information [12] during the iterative process. Alternately, can be directly computed using the key equation after has been computed. Decoder implementations that use the syndrome are generally based on one of the above algorithms. Liu [16] proposed a RS decoder design that used Massey's shift register synthesis, followed by recursive extension of the error transform and, finally, an inverse FT to get the error vector. Shao et al. [17] and Demassieux et al. [18] based their decoder designs on Euclid's algorithm to find the error locator polynomial. Then the error transform was recursively extended and, finally, an inverse FT was performed to get the error values. Shao and Reed [19] , Tong [20] , Whitaker et al. [1] , and Berlekamp et al. [21, Ch. 10 ] used Euclid's algorithm in their decoders to find the error locator and the error magnitude polynomials. Then, Chien search was used in conjunction with Forney's algorithm to calculate the error values. The implementation of Euclid's GCD algorithm was generally based on the systolic array proposed by Brent and Kung [22] . The computational complexity of the syndrome based approach that uses either the Berlekamp algorithm or Euclid's method followed by the Chien search is . Another approach [23] , [12] shown on the right side of Fig. 1 , avoids the computation of the syndrome. Here, the algorithm (either the Berlekamp-Massey or the Berlekamp algorithm) is transformed so that all its variables are in the time domain. This is achieved by taking the Inverse FT of all sequences in the original algorithm. Hence, this algorithm is sometimes referred to as transform decoding without transforms. Shayan et al. [24] designed a versatile decoder based on this technique that can decode any RS code over . The decoder was based on a transformed version of the Berlekamp Massey algorithm and time domain recursive extension. The disadvantage is that the computational complexity is . Therefore, this technique can be used only for small blocklengths. Choomchuay and Arambepola [25] proposed an architecture based on a reorganized form of the computation in [23] that reduces the computational complexity. However, the storage requirements are . Hsu and Wang [26] proposed some modifications to [25] that reduced the storage requirements to . While the order of the computational complexity in [25] and [26] is reduced to , the actual complexity remains larger than for the syndrome-based Berlekamp algorithm.
In Section II, we summarize our approach to low-power/highspeed RS decoding. In Section III, we discuss how the Berlekamp algorithm can be modified to enable a low-power VLSI architecture/design. We first discuss modifications to the errors-only Berlekamp algorithm in Section III-A. We later extend these modifications to the errors-and-erasures Berlekamp algorithm in Section III-B. In Section IV, we discuss the VLSI architecture of an RS decoder including, specifically, architecture level techniques for the syndrome and error computations. In addition, we compare architecture level power estimates for the normal and modified designs. We also specify under what conditions power reduction can be expected. In Section V, we discuss the VLSI chip design of the normal and modified decoders. We also discuss the synthesis and layout results which show that the power consumption can be reduced by 40%. Finally, in Section VI, we present concluding remarks.
II. THE ALGORITHM/ARCHITECTURE-BASED APPROACH
Until recently, the two key parameters in VLSI design were area and speed. Power consumption was a consideration only in order to reduce packaging and cooling costs for the chip. With the proliferation of portable devices, power consumption has become a primary design parameter. This is because power consumption determines the battery lifetime in portable systems. Algorithm/architecture-level transformations can potentially have the greatest impact. This is because, at the algorithm level, maximum design flexibility is available. In addition, device-and circuit-level techniques can be applied independent of these algorithm/architecture-level techniques to obtain further power savings. We therefore focus on algorithm/architecture-level techniques to obtain low-power RS decoding. Various algorithm/architecture-level transformations [27] [28] [29] [30] [31] have been proposed in the literature. The transformations proposed include algebraic transformations [28] (such as associativity, distributivity, etc.), loop-unrolling/look ahead transformations [32] that try to break the recursive loops, retiming [27] , folding/unfolding [33] transformations, and strength reduction [34] . These transformations can be used to obtain area-efficient high-speed or low-power designs [31] , [35] [36] [37] , depending on the design goal. One of the approaches to reduce the power consumption is to use these transformations to expose parallelism and enable pipelining in the computation.
Over the last decade, we have seen continued scaling down of device feature size. This has improved the performance of VLSI designs in terms of speed. Also, algorithms of increased complexity can be implemented on a single chip. In addition, the area required to implement a given algorithm has shrunk dramatically because of the increased transistor density. It has been suggested in [30] that low-power operation can be obtained by modifying the algorithm to enable a parallel implementation. In other words, we utilize the increase in available transistor count cleverly to obtain power reduction. The idea can be explained briefly as follows. The power dissipation in a well-designed digital CMOS circuit can be modeled as [38] (2) where average probability that the total node capacitance is switched (also referred to as the activity factor); effective load capacitance; supply voltage; operating clock frequency. Similarly, the delay of the CMOS device can be approximated [30] as (3) where capacitance along the critical path; a device parameter; threshold voltage of the transistor. Fig. 2 shows the normal and modified designs. For the moment, let us assume that such a formulation is possible. We will show later how transformations can be applied to obtain such a formulation. The modified design can operate at a slower speed while maintaining throughput. If, for example, the critical path delays for the normal and modified designs are and , respectively, at a supply voltage of . Then, the supply voltage of the modified design can be reduced to to reduce power (where is chosen such that ). The ratio of power consumption in the modified design to the normal design can be written based on (2) as (4) where and are the effective capacitances of the normal and modified designs.
In this paper, we consider an RS decoding algorithm [39] that starts with the computation of the syndrome. Then, the error locator and error magnitude polynomials are computed using Berlekamp's algorithm. Finally, the error locations are found using Chien's search, and the error values are computed using Forney's algorithm. The syndrome computation involves evaluating the received polynomial at . The Chien search involves locating the errors by evaluating at and comparing the result with zero. The Forney's algorithm computes the error values by evaluating and at the error locations. Architecture-level techniques can be used to create additional parallelism in the syndrome and error-value computations as we will show later. The computation of the error locator and error magnitude polynomial is more involved and creating additional parallelism is more difficult. We will discuss algorithm level techniques that modify the Berlekamp algorithm to achieve low-power operation. It should be pointed out that the Berlekamp algorithm and Euclid's algorithm have the same computational complexity. We choose to use the Berlekamp algorithm because it possesses some properties that enable the application of a look-ahead [32] like transformation.
III. LOW-POWER BERLEKAMP ALGORITHM
As mentioned briefly in the previous section, the error-correction problem involves finding the location of the errors and their corresponding values. In particular, we can correct errors using a code that has a minimum distance . The location of erasures are known to the decoder. Correcting erasures, therefore, involves only finding the erasure values. In general, any pattern of erasures and errors can be corrected provided . The Berlekamp algorithm can be used to solve both problems.
We first consider techniques to modify the Berlekamp algorithm for low-power operation in the context of error correction, in Section III-A. Then, we extend our approach to solve the errors-and-erasures correction problem in Section III-B. We assume that the number of redundant symbols is given by so that we can correct up to error patterns.
A. Errors-Only Decoding
The original Berlekamp algorithm is reproduced here. When expressed in this form, we can observe some of its properties that will aid in the development of our algorithm modifications. Note that the Berlekamp algorithm (Algorithm 1, described below) computes both the error locator polynomial and the error magnitude polynomial in parallel (see steps 5a and 5b of Algorithm 1). (refer to Table I) 5b.
(refer to Table I) 6.
(refer to .
We want to look at updating to without going through . In this way, we want to halve the number of iterations. Of course, this modified iteration will be more complicated than a single iteration of the original algorithm. In general, the modified iteration takes time . In order to get an improved algorithm in terms of speed/power, it is enough if . In particular, the modified algorithm must expose additional parallelism so that the above holds. Ideally, we would like to have as close to as possible in order to maximize the improvement in speed or power. We observe that the update matrix for is the same as that for (see Steps 5a and 5b in Algorithm 1). This implies that any transformation that we derive for will apply to the polynomial pair . We note therefore that the critical variables to consider when modifying the Berlekamp algorithm are , , and . We make several observations about the variables involved in the Berlekamp algorithm that will enable the transformation to be performed efficiently. Note that only if and . This implies that can increase only once in any two iterations. We can prove this property by contradiction. Let us assume that . This implies that . Also, . Using in , we get . Since we cannot have and , we have obtained a contradiction that completes the proof. Also, observe that is updated only when increases. Otherwise, is just shifted. Note that the pair defines a linear feedback shift register [12] of minimum size that generates the sequence . In other words (5) for . Let be the discrepancy at iteration . In what follows, we refer to as the odd iteration and as the even iteration. We want to design the modified algorithm so that the critical variables at the end of the even iteration of the original algorithm match the variables after the th iteration of the modified algorithm. Define a new variable as the discrepancy predicted for the even iteration. Note from Algorithm 1 that if (since, in this case ). We will derive the modified algorithm by starting with and and considering the effect of two consecutive iterations of the original algorithm.
Various cases that need to be considered are shown in Fig. 3 . Therefore, the final updates for cases I, II and III can be written as shown in Rows 1, 2, and 3, respectively, of Table II. For cases IV and V, so that . For both these cases, in the odd iteration we get . For case IV, since , for the odd iteration we get and . On the other hand, for case V, for the odd iteration we get and . We will use the idea of Massey's synthesis (as explained in [12] ) to find an update from to for cases IV and V using the syndromes and variables that can be computed at . We first consider case IV. In this case, it is impossible for (as explained before). Therefore, and . The update equation for can be written as . The problem with this update is that in the modified algorithm there is no direct way to compute G(x) = 1 0 1 x AND G1(x) = 0 x 0 1 x (since we want to avoid computing ). We approach the problem by writing the update in terms of and as (6) where in an unknown. We know from the original algorithm that can be written as , where is the last iteration for which . We will use the property that must represent the minimum length shift register for the sequence . This implies that . Based on this identity, we can solve for as follows:
where . This implies . The update for can be written as . Note that we need to maintain the variable from one iteration to the other in the modified algorithm. Observe that the term needs to be updated only at iteration for . We will discuss the details of this update after we discuss case V.
For case V, we have (the last inequality follows because both and are even (which can be interpreted as a 2-step predictor for the discrepancy) at each iteration of the modified algorithm.
We also note that in order to differentiate between the five cases as shown in Fig. 3 , we do not need to test both and (i.e., we do not need to check both and ). In cases I, II and III, and this makes the two tests equivalent. This is because (since is even, cannot equal ). Also, . In cases IV and V, we need to test only . This suggests that we can do away with the test and perform only the test . All of the above is summarized in Algorithm 2 and Table II. Notice that the number of iterations has been reduced to . We will exploit this fact to obtain a high-speed/low-power implementation.
The only difference between the Berlekamp Massey synthesis and the Berlekamp algorithm is that the former computes only , while the latter computes and in parallel (see steps 5a and 5b of Algorithm 1).
Up to this point, we have discussed the computation of without referring to . In
Step 5b, are updated with the same matrix (see step 5b of Algorithm 1). Therefore, we can apply the modifications discussed above to step 5b of Algorithm 1 as well to obtain step 9b of Algorithm 2 (described below).
The modified Berlekamp algorithm can be compactly written as follows: (refer to Table II) 9b.
Low
(refer to Table II) 10.
(refer Table II .
B. Extension to Errors-and-Erasures Decoding
In this section, we further develop the algorithm modifications for situations in which we want to correct a combination of errors and erasures that are present in the received word. An erasure at a particular position in the received word indicates that the received symbol at that position is incorrect. Erasure information is usually available when possible errors are detected in other parts of the system (for example, by the demodulator or by the inner decoder in a concatenated coding scheme). In error correction, both the error values and error locations need to be found. In erasure correction, on the other hand, only the erasure values are to be found (erasure locations are known at the input to the decoder). Let us assume that there are errors and erasures in the received word. In general, errors and erasures can be corrected if . Let denote the received vector, the error values and the erasure values. Also, let denote the error locations and denote the erasure locations. Note that the (12) and (13) then similar to the proof in [12] , can be updated as (14) with the matrix defined as in Table I . This means that if we can initialize at (15) and (16) then we can obtain at . We propose to use iterations to compute this initialization. If we assume that is pre-calculated along with the syndromes, then and are available at the start of the algorithm at . We initialize . For iterations , we define the update matrix so that and are unchanged. Also, we set . We notice that and that over . This means that we can compute one coefficient of in each iteration. In the th iteration, we can obtain by using the update matrix (Row 3 of Table III) . Also, at the end of the th iteration, we set . This completes the initialization according to (15) and (16) . The complete algorithm is given in Algorithm 3. (refer Table III) 6b.
Berlekamp Algorithm Errors and Erasures
(refer Table III) 7.
(refer Table III .
In order that we be able to modify Algorithm 3 (in the same way that Algorithm 1 was modified to obtain Algorithm 2), we need to be even. This will allow us to reduce the first iterations in Algorithm 3 into iterations. We have originally assumed that the number of redundant symbols is even. Under this assumption we will show how to manipulate the erasures so that their number is always even (without changing the error and erasure correcting capability of the decoder). Let us consider a situation in which the number of errors and erasures are within the error-correction capability of the code (i.e.,
). If in addition, the number of erasures in a codeword is odd, then . For the sake of argument, let us assume that we can convert one of the erasures into an error, we still have (the codeword remains correctable). Alternately, we can add one more erasure to the codeword. Again, we have . In both these cases, we have made the effective number of erasures even. Note that if the input codeword is such that , the above procedure We can detect such a case during the decoding process and flag it so that no changes are made to the codeword as it passes through the decoder. We note that for iterations in Algorithm 3, the modifications are the same as when we derived Algorithm 2. These modifications are shown in the first five rows of Table IV . In order to compute and in the first iterations, we define the update matrix so that . In addition, we recognize that we can obtain coefficients of the polynomial as and . In each iteration, we compute 2 coefficients of the polynomial . Additionally, which ensures that . Therefore, Row 6 of Table IV performs the correct update of the polynomials and . At iteration , we set and ensure that is initialized correctly for the iterations that compute the error locator polynomial (i.e., for ). (refer Table II) 10b.
Low-Power Berlekamp Algorithm-Errors and Erasures
(refer Table II) 11.
(refer Table II)  12. (refer Table II We note that these modified algorithms can be used to decode shortened and punctured RS codes without any further change. This is because we can conceptually treat all the dropped symbols as erased positions [2] and use the same decoding algorithms. Also, the errors-and-erasures decoder algorithm can be used for encoding [2] . In this case, the message symbols along with erased symbols are fed into the decoder. The erasures are "corrected" by the decoder so that the output of the decoder is the symbol codeword corresponding to the message symbols.
IV. VLSI ARCHITECTURE
In the previous section, we discussed algorithm-level transformations for the Berlekamp algorithm. A variety of architectures can be developed based on these algorithms. In this section, we discuss how this algorithm transformation translates into one specific VLSI architecture for the errors-only decoding problem. A similar architecture can also be developed for the errors-and-erasures case. In addition, we also discuss architecture level transformations for the syndrome and error computations in errors-only decoding.
A. Architecture for Syndrome Computation
The syndrome computation is given by (17) This equation can be computed using Horner's rule as (18) An architecture that computes is shown in Fig. 4 . Let us assume that the register is cleared before we start the computation and that the sequence is fed into the input in that order. At clock cycle , the previous accumulated value is multiplied with and added to . At the end of clock cycles, we obtain in the register. Using such cells, we can compute in parallel. In order to enable a parallel computation which can be completed in approximately clock cycles, we can rewrite (17) as (19) when is odd. The limits of both the sub-summations will change to when is even. Note from (19) that the first and second sub-summations consist of even and odd symbols, respectively. Both these sub-summations can be computed using Horner's rule (similar to (18) ). An architecture for this computation is shown in Fig. 5 . At the end of clocks (when is odd), we can obtain the two sub-summations. Finally, the odd sub-summation result is multiplied by and added to the even sub-summation. This means that the modified architecture can complete the computation in approximately half the number of clock cycles as the normal syndrome computation.
B. Architecture for Error Computation
The error value computation based on Forney's method can be written as if if (20) Note that we need to compute and for in order to compute all the . In any field of characteristic 2 (21) and therefore, can be obtained as part of the computation for . This is because , where and are the polynomials formed by the even coefficients and the odd coefficients of , respectively. In addition, the circuit computes (this avoids the need for the initialization of in Algorithm 2). We can write the error computation in these terms as if if (22) An architecture for the evaluation of at is shown in Fig. 6 . Initially, the coefficients of the polynomial are loaded into the registers. After the initialization is complete, the multiplexer passes the value that is fed back. The value in the th register is multiplied by , so that the th register holds at the th clock cycle. The contents of the registers are added up to obtain for . A similar architecture can be used to evaluate . This implies that we can evaluate at the th cycle. In other words, in cycles we can compute all the required error values.
If a parallel architecture is used to compute and , then all the required error values can be computed in cycles. This modified architecture is illustrated in Fig. 7 . In this case, the outputs of the registers are used to compute and the outputs of the registers compute for when is odd.
C. Architecture for the Berlekamp Algorithm
The original Berlekamp and its modified version have been discussed in detail in Section III. In this section, we discuss architectures for both Algorithm 1 and Algorithm 2. An architecture for performing the computation described in Algorithm 1 is shown in Fig. 8 . The registers shown in the diagram hold the polynomials and . The exact manner in which they hold the various coefficients will become clear shortly. The register holds the algorithm iteration counter and mirrors the variable in Algorithm 1. At the start of the th iteration, register holds in positions , respectively. In the next clocks, are computed in sequence and shifted into position . The register holds the serial iteration counter that increments at every clock and synchronizes the various serial operations. The counter has a range from (i.e., is a modulo counter) and in incremented when . is accumulated serially at the same time that is serially updated to . In particular, the partially accumulated result is available when the serial iteration counter has the value . The contents and updates of register are defined in order to enable the computation of in iteration when is updated to . Register is initialized so that it holds in positions , respectively. For , the positions are right rotated by one position each time and the postions are kept unchanged, while is computed. When , positions of register are left rotated by one position so that holds .
At the beginning of iteration and are available. Note that is initialized to . These variables are used to compute the variables in the update matrix as well as other decision variables such as and . Since the update matrix contains in its second column, we conclude that a shifted version of (i.e., ) is also required. In order to enable an update for in a serial manner, a new position is introduced. Register holds in positions the values at the start of iteration . The results are shifted into register at position . Depending on whether the shifted version of is required or not, the contents of either or is used in the update. and are updated in a similar manner since the update matrix is the same as that for and . A simliar architecture can be developed corresponding to the modified algorithm (i.e., Algorithm 2). This architecture is shown in Fig. 9 . As before, the inner products for and are computed as is being updated to . The algorithm iteration counter in this case counts from . In this case, the serial iteration counter increments from 0 to (i.e., is a modulo counter). We note the presence of in the first column of the update matrix . This indicates that a shifted version of is required. This means that we need to add one position to the register (just as we did for the register in the architecture for the original algorithm). The second column of contains both and . This suggests that we need two shifted versions of (i.e., and ). We add two positions and to the register. In particular, at the start of the th iteration the register holds in positions respectively. As the serial iteration counter increments from 0 to are computed in sequence. The register at the start of th iteration holds in positions respectively. and are updated in a similar manner, while increments from 0 to . and are accumulated, and while increments from 1 to . The register must be organized to provide the appropriate data for the and accumulators. At the beginning of iteration 1, register holds in positions and in positions , respectively. For , positions are kept unchanged, while positions are rotated and positions are right shifted using the value out of . When or , positions are left rotated so that holds . When and are loaded with the values being shifted into and , respectively. This ensures appropriate operation of the architecture in Fig. 9 .
In summary, Fig. 8 describes an architecture for the original errors-only algorithm (i.e., Algorithm 1), while Fig. 9 describes an architecture for the modified errors-only algorithm (i.e., Algorithm 2).
D. Extension of Architecture to Handle Errors and Erasures
When erasures are present in addition to errors, the symbols of the received word that are erasures are flagged. We refer to these flags as erasure indicators . The binary variable indicates the presence or absence of an erasure (i.e., indicates the presence of an erasure and the absence of an erasure at location ).
We know from Section III-B that the erasure locator polynomial needs to be computed to provide the appropriate initialization for Algorithm 3. Fig. 10 shows an architecture that computes the erasure locator polynomial from the erasure indicators. This computation is performed in parallel with the computation of the syndrome (see Fig. 4 . Note that the only modification required in the syndrome computation is that when is set to zero. The registers correspond to the coefficients of the erasure locator polynomial . Register is initialized to oneand the other registers 's are set to zero. If the erasure locator coefficients need to be updated based on (23) Therefore, the updated coefficients are given by . Note that is generated and stored in the register shown on the right side of Fig. 10 . Depending on whether or the coefficients are modified or left unmodified as indicated by the multiplexer circuitry. After clocks, we get the coefficients of the erasure locator polynomial . An erasure counter is also maintained to count the number of errors in the codeword.
An architecture for the modified erasure locator computation is shown in Fig. 11 . This allows the erasure locator computation to be completed in clocks so that this computation can be done in parallel with the modified syndrome computation. We assume that, again, register is initialized to one and the other registers 's are set to zero. Since we have two erasure indicators that are input at one time, we have four cases corresponding to the four binary combinations of and . Case A corresponds to the combination . In this case, the registers need not be changed. Case B corresponds to the combination and . In this case, we need to modify the registers based on (24) In case C, which corresponds to and , we need to update based on (25) For case C, the update for the coefficient is given by . In case D, which corresponds to , we need to update the registers based on (26) For case D, the update for the coefficient is given by . The architecture corresponding to this update is shown in Fig. 11 . As we mentioned earlier, we can obtain the coefficients of after clocks. Algorithm 4 works only when is even. We explained in Section III-B how the number of erasures can be made even without loss in performance (within the code's error-correcting capability) as long as the is even. The computation of the erasure locator polynomial has to take care of this. We have to pre-modify some of the erasure indicators (in particular, and ). If, at the start of iteration , the erasure counter is odd and , then we set and . On the other hand, if the erasure counter is even and , then we set and . In all other cases, the original values of and are retained. This ensures that the erasure locator polynomial always has an even degree.
E. Architecture Level Power Estimates
In Sections IV-A through IV-C, we have described VLSI architectures for the normal and modified algorithm. The original syndrome computation takes approximately clocks, while the modified version takes only approximately . Similarly, the error computation takes clocks, while the modified version takes approximately clocks. The original Berlekemp architecture takes clocks, while the modified version takes . Let us assume that each of the modules are implemented as state machines with pipelining between the modules. This indicates that our algorithm and architectural modifications can be put into the framework, as shown in Fig. 2 . Fig. 12(a) and (b) show the pipeline stages and the number of clock cycles used by each module of the original architecture and modified architecture, respectively. We also observe that the architectures described in the previous sub-sections are independent of and the critical paths are independent of . We will investigate under what conditions we can expect a low-power implementation.
Let us assume that the normal circuit can work at a maximum clock speed of and the modified one at a maximum clock speed of . We can operate the modified design in two modes: high speed and low power. In the high-speed mode, the modified design is operated at the same supply voltage as the normal design. The throughput will be times the throughput of the normal design. So a speedup can be expected as long as . In the low-power mode, the throughput is maintained constant and the voltage of the modified design is scaled down. This means that the clock rate of the low-power design can be slowed down to
. If the critical paths of the normal and modified designs at the maximum voltage are and , respectively, then the voltage can be reduced to (at this voltage the critical path delay in the modified design becomes ). The ratio of capacitances can be estimated as the ratio of active areas of the designs. Therefore, the ratio of the power consumption in the modified design to the normal design can be written based on (2) as (27) We can obtain accurate estimates for and by actually synthesizing a VLSI layout of both these designs and performing SPICE simulations of the critical paths. We also note that at the same supply voltage, our modified design has a smaller latency when compared with the normal design. In particular, at the voltage , the latency of the modifed design is times the latency of the normal design. This can be important in applications in which decoding delay is critical.
Note that the two decoders shown in Fig. 2 generate the error word as output. This means that for both designs we need to maintain the received word in a buffer and finally add the error word to the received word to get the corrected word. Note that if we want the exact same input and output data interface for both designs, then we need to generate the odd and output sequences on-chip at the input of the modified decoder. Also, we need to multiplex the two error values into a single sequence at the output of the modified decoder. In particular, at the input, we need just two bit registers and an bit 2 to 1 multiplixer (working at a clock rate ) to direct the odd and even inputs into the correct registers. Similarly, at the output, we need two bit registers and an bit 2 to 1 multiplexer (working at a clock rate ) that chooses the appropriate register to direct to output. We note that the additional circuit complexity required for this is very small. In our designs, we assume that the modified design accepts two inputs in parallel and generates two outputs in parallel. In the next section, we will describe some of the considerations involved in the VLSI design of the normal and modified architectures. Based on the normal and modified architectures developed in the previous section, we designed two separate decoders for a RS code that can correct up to three errors. A binary representation for was chosen to minimize the complexity (in terms of the number of transistors) of the GF multiplier and GF inverter. Elements of were written in terms of a concatenation of two elements (see [21, Ch. 10] . Based on this representation, the multiplication over can be written in terms of multiplications over . The inverse over can again be written in terms of multiplications over and as well as inversions over (see [21, Ch. 10] . This allowed for an optimal choice of binary representation for that leads to low complexity multipliers and inverters over . Multipliers and inverters were constructed as combinational circuits [21, Ch. 10], [40] .
The complete algorithm was simulated in C and the test vectors were generated for the circuit. The architecture was then described using Verilog HDL and the functionality was verified. Then a timing driven synthesis was performed using Cadence's Synergy synthesis tool. A 0.8-CMOS standard cell library was used with the synthesis tool to obtain a flat gate level netlist. The functionality of the netlist was again verified against the C simulations. Finally, Cadence's Silicon Ensemble place and route tool was used to produce a channel-less layout. Routing was performed using three levels of metal. A SPICE netlist of the critical path transistors was extracted along with routing capacitances.
The above design process was completed for the normal and modified designs. The layout of the normal and modified designs are shown in Figs. 13 and 14 , respectively. The corresponding areas are mm mm and mm mm. Table V summarizes the results. Fig. 15 shows how the critical path delays vary with voltage. The critical path delays were extracted from SPICE simulations. At a supply voltage of V, the critical path delays of the normal and modified designs are and ns. The voltage corresponds to the supply voltage at which the modified design is slowed down to ns. From Fig. 15 , we obtain V for the modified design. This shows that the power consumption can be reduced by about 40%. Alternately, a speed-up by a factor of 1.34 can be obtained. In particular, the modified design in the high-speed mode can support a throughput of up to 192 Msymbols/s. Note that if a slower throughput is required, then either slower serial GF operators can be implemented or the GF operators may be re-used to do multiple operations [41] in both the normal and modified designs.
VI. CONCLUSION
In this paper, we have developed a low-power/high-speed Berlekamp algorithm that enables low-power/high-speed operations. We showed that similar modifications can be derived for both the errors-only decoding as well as the errors-and-erasures decoding. Our algorithm-level approaches expose additional parallelism that enable us to design a low-power RS decoder. Architecture level approaches were proposed for the syndrome and error computations. An architecture was proposed for the Berlekamp algorithm that takes advantage of the algorithm-level transformations. The impact of the algorithm and architecture level approach was evaluated by designing two decoders for a RS code-one based on the normal algorithm and the other based on the low-power algorithm. Results showed that a speedup of 1.34 or power saving of 40% can be obtained. This validates our claim that algorithm level transformations, when intelligently applied, can have a strong impact on the power consumption of a design.
