Abstract-Interpolation-based algebraic soft-decision decoding (ASD) of Reed-Solomon (RS) codes can achieve significant coding gain with polynomial complexity. Among available ASD algorithms, the low-complexity Chase (LCC) algorithm can achieve a good performance-complexity tradeoff. In addition, the multiplicity of each interpolation point involved in this algorithm is one. These features make the LCC decoding very attractive for practical hardware implementation. In this paper, we present an efficient and high-speed VLSI architecture for the implementation of the LCC decoder. ASD algorithms have two major steps: interpolation and factorization. The high efficiency of the LCC interpolation architecture is achieved by employing a backward interpolation technique, which enables the sharing of intermediate interpolation results. We also show that the factorization step can be eliminated in the case of LCC decoding. From critical path and latency analysis, the LCC decoder can achieve a throughput of several gigabits per second in ASIC implementations. In addition, the LCC decoder requires less than three times the area of a hard-decision decoder that has the same throughput.
Abstract-Interpolation-based algebraic soft-decision decoding (ASD) of Reed-Solomon (RS) codes can achieve significant coding gain with polynomial complexity. Among available ASD algorithms, the low-complexity Chase (LCC) algorithm can achieve a good performance-complexity tradeoff. In addition, the multiplicity of each interpolation point involved in this algorithm is one. These features make the LCC decoding very attractive for practical hardware implementation. In this paper, we present an efficient and high-speed VLSI architecture for the implementation of the LCC decoder. ASD algorithms have two major steps: interpolation and factorization. The high efficiency of the LCC interpolation architecture is achieved by employing a backward interpolation technique, which enables the sharing of intermediate interpolation results. We also show that the factorization step can be eliminated in the case of LCC decoding. From critical path and latency analysis, the LCC decoder can achieve a throughput of several gigabits per second in ASIC implementations. In addition, the LCC decoder requires less than three times the area of a hard-decision decoder that has the same throughput.
I. INTRODUCTION
Reed-Solomon (RS) codes are used as error-correcting codes in many applications, such as computer hard drives, wireless and optical communications, and deep-space probing. Currently, the hard-decision Bedekamp-Massey algorithm (BMA) [1] is employed in practical systems to decode RS codes, due to the existence of very high speed hardware implementations. However, the BMA can only correct errors up to half the minimum distance of the code. Numerous research has been carried out on soft-decision RS decoding algorithms, which can correct more errors by making use of the reliability information from the channel. Nevertheless, previous soft-decision decoding algorithms either can only achieve limited coding gain or have very high complexity. Algebraic soft-decision (ASD) decoding algorithms [2] , [3] , [4] , [5] , [6] , [7] for RS codes have been developed recently. By incorporating the probability information from the channel into the algebraic interpolation process developed by Sudan and Guruswami [8] , [9] , these algorithms can achieve significant coding gain with a complexity that is polynomial with respect to the codeword length.
ASD algorithms consist of three steps: multiplicity assignment, interpolation and factorization. The multiplicity assignment step affects the overall error-correcting performance and This work is supported by NSF under grants 0846331 and 0835782 complexity of the ASD algorithm. For the purpose of practical implementation, simple multiplicity assignment schemes are preferred. The multiplicity assignment in the Kotter-Vardy (KV) algorithm [2] can be implemented by constant multiplications followed by the floor function, and those in the low-complexity Chase (LCC) [6] and bit-level generalized minimum distance decoding (BGMD) [7] algorithms can be implemented by comparators. Smaller multiplicities translate to lower complexity in the interpolation and factorization steps. On the other hand, smaller multiplicities do not always lead to inferior error-correcting performance. For example, with multiplicity one and eight test decoding, the LCC algorithm can achieve similar or higher coding gain than the KV algorithm with maximum multiplicity four for a (255, 239) RS code [10] .
In this paper, we present a high-speed VLSI architecture for the implementation of the LCC decoder. For an (n, k) RS code, the LCC algorithm carries out test decoding on multiple vectors of n interpolation points with multiplicity one. The re-encoding and coordinate transformation techniques [11] , [12] can be applied to exclude k points from the interpolation process. Nevertheless, if the interpolation is carried out on each test vector from scratch, the extra complexity of interpolating over multiple vectors may offset the savings brought by the small multiplicity. Fortunately, the test vectors in the LCC decoding share common points. The backward interpolation scheme [13] can be employed, such that the interpolation result of a vector can be derived from that of another by taking care of only the points that are different between the two vectors. The factorization can still be carried out directly on the interpolation output when the re-encoding and coordinate transformation are applied. In this case, although a harddecision decoding post-processing is required to recover the actual errors, the number of iterations need to be carried out in the factorization can be substantially reduced. The factorization architecture can be further simplified when the multiplicity is one. However, it still accounts for a significant part of the overall area of the LCC decoder. Recently, it was discovered that the error locations and magnitudes can be computed directly from the interpolation output in the case of LCC decoding [14] . As a result, the factorization step and the key-equation solver in the post-processing hard-decision decoding can be eliminated. The details of the factorizationfree LCC decoder employing the backward interpolation are presented in this paper. In addition, the complexities of the LCC and BMA decoders are compared.
The structure of this paper is as follows. Section II introduces the LCC decoding, and the re-encoding and coordinate transformation techniques. Section III presents the backward interpolation and corresponding architectures. How to derive the error locations and magnitudes without factorization is described in Section IV. Section V presents comparisons of the LCC and BMA decoders. Conclusions are provided in Section VI.
II. LCC DECODING, RE-ENCODING AND COORDINATE TRANSFORMATION
Without loss of generality, RS codes constructed over GF (2 q 
According to this evaluation mapping encoding, the message polynomial can be recovered by interpolating over the points -1) ). However, the codeword might be corrupted during the transmission. Given the observation of the received symbol at the jth code position, the associated interpolation points can include (aj, w) for any w E GF(2 Q ). ASD algorithms put higher weight on those more reliable points during the interpolation in order to increase the probability that the correct message polynomial can be recovered.
ASD algorithms consist of three steps: multiplicity assignment, interpolation and factorization. They are different in the multiplicity assignment step, and share the same interpolation and factorization steps. The multiplicity assignment decides the interpolation points and their multiplicities by making use of the reliability information from the channel. This step affects not only the error-correcting performance of the ASD algorithm, but also the complexity of the following two steps.
In the LCC decoding, there are 217 ('Tl E Z+, ' Tl < n -k) test vectors of n interpolation points. Although each point has multiplicity one, the reliability information is incorporated in the decision of the interpolation points. Each of the ' Tl most unreliable code positions is assigned two interpolation points:
(aj, (3j) and (aj, (3j) , where {3j is the hard-decision of the jth received symbol and {3j differentiates from {3j in only the least reliable bit. For the rest of the n -' Tl code positions, only one interpolation point, (aj, (3j) , is assigned. The test vectors are formed by picking one interpolation point for each code position. Since there are two possible points for each unreliable code position, the total number of test vectors is 217.
The multiplicity assignment in the LCC decoding can be implemented by comparators. On the contrary, the interpolation and factorization steps are much more hardware-demanding. In addition, the interpolation and factorization need to be carried out on each test vector in the LCC decoding. The complexity of the LCC and other ASD algorithms can be reduced by applying the re-encoding and coordinate transformation techniques [11] , [12] . The basic idea of the reencoding is to first pick the k most reliable code positions in the received word r, and denote them by the set R. Then an erasure decoding is applied to the k symbols of r with index in R to derive another codeword ¢. Since c = c + ¢ is also a codeword, the error vector, e, of the codeword c can be found by decoding f = r + ¢ = c + e + ¢ = c + e instead. The advantage of decoding f instead is that the k symbols in f with index in R are zero. Accordingly, the interpolation over these points can be pre-solved as IIiER(x + ai) and the expensive bi-variate interpolation only needs to be carried out on the points in the rest n -k code positions. In addition, coordinate transformation can be applied to factor out IIiER(x + ai), which is now a common term of all polynomials involved in the bivariate interpolation. As a result, the length of the polynomials and the memory requirement of the interpolation can be also reduced.
The factorization can still be applied directly to the interpolation output when the re-encoding and coordinate transformation are employed. In this case, the actual errors in the k most reliable code positions can be recovered after a harddecision decoding, such as BMA, in which the factorization outputs are used as syndromes. If 7 errors need to be corrected in the k most reliable code positions, 27 syndromes need to be computed from the factorization. The number of iterations in the factorization equals the number of symbols need to be computed. Originally, k symbols need to be computed as the coefficients for each f(x) factor in the factorization. Since 27
can be set to a number that is much smaller than k, the complexity of the factorization can be also significantly reduced as a result of re-encoding and coordinate transformation, despite the extra hard-decision decoding. After the errors in the k most reliable code positions have been corrected, an erasure decoding can be applied to recover the transmitted codeword c. This decoding process is illustrated in Fig. 1 [14] .
Besides the re-encoding and coordinate transformation, other techniques and architectures have been proposed to reduce the complexity of the interpolation [15] , [16] , [17] , [18] , [19] , [20] . These architectures can be further simplified in the case of multiplicity one. However, in the Lee decoding, the interpolation needs to be carried out on each test vector. Starting the interpolation for each vector from scratch may offset the savings brought by the small multiplicity. Two interpolation algorithms can be employed for practical implementations: the Nielson's algorithm [21] , [22] and the Lee-O'Sullivan (LO) algorithm [23] . Although the LO algorithm has lower complexity when the maximum multiplicity is less than three [20] , it does not allow interpolation points and their multiplicities to be changed once the interpolation started. Hence, the point-by-point Nielson's algorithm is employed in our Lee decoder design in order to enable the sharing of intermediate interpolation results. A backward interpolation scheme for the Lee decoding has been proposed in [13] to eliminate points from given interpolation results. Employing this scheme, the interpolation over the second and later test vectors only needs to take care of the points that are different from the previous vector. Section III presents the details of this scheme. The factorization problem can be solved by using the iterative algorithm proposed by Roth and Ruckenstein [24] . When the y-degree of Q(x, y) is larger than two, the bottleneck of this algorithm lies in the exhaustive-search-based root computation over finite fields required in each iteration. Several architectures have been proposed to increase the speed of the root computation and factorization [25] , [26] , [27] , [28] . In the case of Lee decoding, the y-degree of Q(x, y) is one. Accordingly, the root computation only needs to be done for degree one polynomials. Hence it is no longer a bottleneck. A selection technique has been proposed in [6] to pass the interpolation output of only one test vector to the factorization step at the cost of small performance degradation. Nevertheless, the factorization architecture still accounts for a significant proportion of the overall decoder area. It was discover in [14] that the factorization and the key equation solver in the following hard-decision decoder can be actually eliminated in the case of Lee decoding. This scheme will be detailed in Section IV.
III. BACKWARD INTERPOLATION
The test vectors in the Lee decoding can be ordered such that the adjacent vectors only have one pair of points different, and the different points are in the form of (aj,{3j) and (aj,{3j). If (aj,{3j) can be eliminated from the interpolation result of the current vector, then the interpolation result for the next vector can be derived by adding (aj, (3j) using the Nielson's algorithm. In this case, the interpolation for the second and later test vectors only needs to take care of the different points. Accordingly, significant computation reduction can be achieved. Eliminating points from a given interpolation result is referred to as the backward interpolation, while adding points using the Nielson's algorithm is called the forward interpolation. The backward interpolation is built upon the Nielson's algorithm. Hence, the Nielson's algorithm is described first in the following. by the total number of interpolation constraints and the lexicographical order of monomials according to the (1, k -1) weighted degree. For high rate codes, t equals the maximum interpolation multiplicity. The interpolation constraints are satisfied one after another. In the iteration for constraint
is computed for each candidate polynomial. If this coefficient is zero, it means that the constraint (a, b) of point (a, (3) is already satisfied by the corresponding polynomial. Otherwise, the polynomials are updated in steps A3 and A4 to force these coefficients to zero. In addition, the polynomial updating does not affect the interpolation constraints that have already been satisfied in previous iterations. At the end of each iteration, the candidate polynomials form a Grabner basis of the module consisting of polynomials with maximum y-degree t that satisfy all previously covered interpolation constraints. Since the polynomial with minimum weighted degree in the Grabner basis is the polynomial with minimum weighted degree in the module, the desired interpolation output can be found after iterations for all constraints are carried out. When the re-encoding and coordinate transformation are applied, the polynomials can be initialized in the same way. However, (I,-I)-weighted degree should be used due to the coordinate transformation.
In the Lee decoding, the maximum interpolation multiplicity is one. Hence, there are only two polynomials involved in the interpolation and the maximum y-degree of these polynomials is one. In another word, the polynomials in the Grabner basis can be expressed as In addition, the division does not affect other interpolation points. Accordingly, after the division, the polynomials form a Grabner basis that passes all points except (a, {3). It might be noted that this Grabner basis may not be the same as that would have been derived by carrying out the interpolation over all points except (a, {3) by using the Nielson's algorithm.
However, they are Grabner bases of the same module. The polynomial with the minimum weighted degree must appear in any Grabner basis of the module and is unique up to a non-zero constant scaler [29] .
The minimum polynomial in the interpolation iteration for the point (a, {3) may not be the minimum polynomial in later interpolation iterations. Hence it can be updated by linear combinations and the factor of (x + a) may be lost at the end of the interpolation for all points. Hence, we first need to find if the polynomials in the Grabner basis at the end have the factor (x + a). It can be derived that Q(l)(x, y) has the factor (x + a) iff q~l) (a) = 0. In addition, it has been proved that the basis polynomials can not all have the factor (x + a) [10] . It is possible that none of the basis polynomials has the factor. In this case, assume u = argminl(W degdq~l) (a) =I-0) and v = {O, I} \ u, an equivalent Grabner basis that still passes all interpolation points can be constructed as
It has been proved that the updated Q(v)(x,y) in (1) contains the factor (x + a). When there is one basis polynomial that contains (x + a), this factor can be divided from the polynomial to form a Grabner basis that passes all interpolation points except (a, {3). In summary, the backward interpolation to eliminate a point (a, {3) of multiplicity one from a given interpolation result can be described by the pseudo codes in Algorithm B.
Algorithm B: Backward Interpolation for the LCC Decoding BI:
compute q~l)(a) for l = 0,1 B2:
In Algorithm B, the linear combination in step B3 is also applied when there is one basis polynomial that contains the factor (x + a) before the linear combination. In this case, q~u)(a) =I-° and q~v)(a) = 0. Hence the linear combination in step B3 reduces to a scaler operation, and leads to an equivalent Grabner basis. The purpose of applying the linear combination in both cases is to reduce the complexity and latency of the control in hardware implementations. It can be observed that the computations in the backward interpolation are very similar to those in the Nielson's forward interpolation. The univariate polynomial evaluation in step B 1 can be computed as an intermediate result of the bivariate polynomial evaluation in step AI. In addition, the B3 and A3 steps are the same. Accordingly, computation units can be shared between these steps. The major difference between the backward and forward interpolation is that instead of multiplying the factor (x + a) in the A4 step, this factor is divided in the B4 step. The division is done in block B by passing G through the multiplexor to the multiplier. No computation needs to be carried out on the other polynomial. This can be achieved by passing '0' through the multiplexor in block C. The PU architecture is also pipelined to limit the critical path to one multiplier, one adder and one multiplexor. 
2" interpolation architecture (ii) Backward-forward
In the Lee decoding, the forward interpolation can be used to derive the interpolation output for the first test vector. After that, the interpolation for each of the second and later test vectors only takes two iterations: one backward and one forward. During this process, only one interpolation result needs to be stored at any time. If only forward interpolation is employed, either (i) the interpolation over the ' TJ unreliable points needs to start from scratch for each test vector, or (ii) intermediate interpolation results need to be stored after each unreliable point is added. Table I lists the iteration number and memory requirement comparisons of different interpolation architectures. Since most of the computation units required for the backward interpolation can be shared with those in the forward interpolation, the area overhead of incorporating the backward interpolation is very small. In addition, the same critical path can be achieved in all interpolation architectures. Hence, our backward-forward interpolation architecture can achieve either substantial speedup or area reduction compared to previous architecture. It has been reported in [10] that our
Lee interpolation architecture for a (255, 239) code with ' TJ = 3 can achieve 48% higher efficiency in terms of speed/area ratio than the previous best design. In addition, it can be observed from Table I , the savings can be brought by our architecture will become more significant when ' TJ increases. The backward interpolation has been extended to the case of iterative BGMD decoding with maximum multiplicity mmax = 2 [10] . In the BGMD decoding, depending on the number of unreliable bits in a symbol, a code position can have either one interpolation point, (Gj, (3j), of multiplicity mmax, two interpolation points, (Gj, (3j) and (Gj, (3i), of multiplicity mmax/2, or no interpolation point. In addition, multiple decoding iterations using different thresholds for bitreliability decision can be carried out to achieve higher coding gain. It has been observed that the BGMD decoding with mmax = 2 and two decoding iterations can achieve similar or higher coding gain than the KV algorithm with maximum multiplicity four [10] . Using the bit-reliability thresholds in a different sequence does not affect the overall error-correcting performance of iterative BGMD decoding. If a lower threshold is used in the next decoding iteration, then the only case can not be handled by forward interpolation is that a code position has two points (Gj, (3j), (Gj, (3i) with multiplicity one in the current decoding iteration, but has one point (Gj, (3j) with multiplicity two in the next decoding iteration. In this case, (Gj,{3j) can be eliminated similarly from the Grobner basis by dividing (x + Gj) from a basis polynomial. However, the same factor has also been multiplied during the interpolation over (Gj, (3j) . There are two polynomials in the Grobner basis containing this factor. We can not tell which polynomial is multiplied by this factor during the interpolation over (G j, (3i) until the quotient is computed and evaluated over (G j, (3i ).
In order to reduce the latency in hardware implementation, both (Gj, (3j) and (Gj, (3i) are eliminated simultaneously by dividing (x+Gj) from two polynomials in the basis. Then the multiplicity of (Gj, (3j) is increased from zero to two by using the Nielson's forward interpolation. Employing this scheme, the interpolation result for the next decoding iteration can be derived from that of the current iteration by taking care of only the different interpolation points. As a result, significant speedup can be achieved compared to starting the interpolation for the next decoding iteration from scratch. Similar ideas can be extended to eliminate all points of multiplicity one from a given interpolation result, if there is no other point with the same a coordinate.
IV. ELIMINATING THE FACTORIZATION
Applying the re-encoding and coordinate transformation, r = c + e is zero in each of the k reliable code positions. Therefore, Ci = ei for i E R. Assume that the message polynomial corresponding to cis /(x). Then /(ai) = Ci = ei for i E R. In addition, ei = 0 if there is no error in the ith position. In this case, (x+ai) is a factor of /(x). Accordingly,
where 8(x) is a polynomial that does not have any root ai for i E Rand ei =I-O. As mentioned previously, the factorization can be applied directly to the interpolation output, Q(x, y), after the re-encoding and coordinate transformation have been applied. Since I1iER(x + ai) has been divided during the coordinate transformation, the factorization will
is a factor of Q(x, y). In addition, it can be derived that
.
Accordingly, the coefficients of ')'( x) can be used as syndromes in hard-decision decoding, such as BMA, to recover the errors in the k most reliable code positions. This further explains the decoding process illustrated in Fig. 1 .
In the case of LCC decoding, the y-degree of Q (x, y) is one.
Hence Q(x, y) can be written in the form ofQ(x, y)
Therefore, ')'(x) can be also expressed as
' ) ' x -( ) .
ql x
Since 8(x) does not have any root ai for i E Rand ei =I-0, by comparing (2) and (3), it can be derived that
where p(x) is the common factor of qo(x) and ql(X). It has been proved in [14] that p( x ) does not contain any factor (x + ai) for i E R. Therefore, the error locations in the k most reliable code positions can be found through computing the roots of ql(X). In addition, from (2) and (3),
ql(X)
Hence, the error magnitudes can be computed by applying the L'Hopital's rule:
Ix=a;, (4) where (.)(d) denotes the formal derivative of the polynomial.
From the above discussion, the error locations and magnitudes for the k most reliable code positions can be computed directly from the interpolation output. Therefore, the factorization and the key equation solver in the following harddecision decoding can be eliminated. In addition, the root computation for ql (x) can be implemented by the exhaustivesearch-based Chien search, which is also required in harddecision decoding. It can be also observed that the computation in (4) is similar to that in the Forney's algorithm for error magnitude computation in hard-decision decoding.
V. COMPLEXITY ANALYSIS AND COMPARISON
After the factorization step and key equation solver have been eliminated, the LCC decoding can be carried out according to the block diagram in Fig. 4 [14] . In this section, the hardware complexity of the factorization-free LCC decoder for a RS (255, 239) code constructed over GF (2 8 ) with '" = 3 is analyzed. In addition, it is compared with that of a harddecision decoder based on the BMA. To the best knowledge of the author, no complexity comparison for hard-decision and soft-decision RS decoders has been published.
Hard decisions of the received symbols are usually made in the receiver front end. In addition, the multiplicity assignment of the LCC decoding can be done using comparators while the hard decisions are made. Hence, the hardware required for multiplicity assignment and hard-decision making is excluded from the comparison. The hardware requirement and latency of other blocks in the factorization-free LCC decoder are provided in Table II . Each block has been pipelined if necessary to make the critical path no longer than one multiplier, one adder and one 2-to-1 multiplexor.
The re-encoder in Fig. 4 implements the re-encoding and coordinate transformation. Re-encoding is basically erasure decoding. A detailed architecture of the re-encoder based on the BMA can be found in [30] . The BMA consists of three steps: syndrome computation, key equation solver, and Chien search and Forney's algorithm. In the case of erasure decoding, the key equation solver can be simplified and the Chien search can be skipped. Due to the coordinate transformation, the f3 coordinates of the interpolation points in code position j (j E R) need to be divided by I1iER(aj +ai). I1iER(aj +ai) can be computed by sharing the hardware for the syndrome computation in erasure decoding. In addition, the inverter is implemented by a ROM of 2 8 x 8 = 256 bytes. The erasure decoder at the end of the LCC decoder can be implemented by the same erasure decoder architecture in the re-encoder. The latency of both the re-encoder and erasure decoder is 528 clock cycles.
The backward-forward interpolation architecture presented in Section III is adopted in our LCC decoder since it can achieve higher efficiency than previous designs. At the beginning of the interpolation for the (255, 239) code, forward interpolation is carried out over the 255-239=16 points with index in R in the first test vector. Since each point is of multiplicity one, 16 forward interpolation iterations are required. After that, one pair of backward and forward interpolation iterations need to be applied to derive the interpolation output for each of the 2'1/ -1 = 7 remaining vectors. The number of clock cycles required in each interpolation depends on the maximum x-degree of the polynomials. From simulations, 525 clock cycles are required for the overall interpolation in the worst case. In addition, at least 28 clock cycles are required for each pair of backward and forward interpolation. The polynomial selection scheme in [6] passes the interpolation output of only one test vector to the following steps of the LCC decoding. This selection is not explicitly shown in Fig.   1 and 4. It is based on the number of roots of ql (x) in R. The exhaustive Chien search root computation needs to be finished in 28 clock cycles in order to match the speed of the backwardforward interpolation. Since IRI = 16 finite field elements need to be searched and L 28/16 J = I, the searching over each element needs to be completed in one clock cycle. The maximum degree of ql (x) is 8. Hence the Chien search can be implemented by 8 multipliers and 8 adders. Decisions need to be made based on the root number after the root search. Taking this into account, the polynomial selection architecture needs to be pipelined into 7 stages in order to limit the critical path to one multiplier, one adder and one multiplexor. Therefore the computation on each interpolation output for polynomial selection takes 16+7=23 clock cycles, which is still less than 28. Since the polynomial selection engine can finish processing one interpolation output before the next one is computed, no extra memory is required to store each interpolation output.
The selected interpolation output is passed to the Chien search & Forney's algorithm block in Fig. 4 to recover the errors with index in R. Pipelining can be applied between the functional blocks in the decoder to increase the speed. On the other hand, the computation in each pipe lining stage should take about the same time in order to increase the hardware utilization efficiency. Taking this into account, the Chien search for computing the roots of ql (x) with index in R can be completed in k = 239 clock cycles with 8 multipliers and 8 adders. The denominator of (4) If pipelining is applied to the LCC decoder according to the cutsets shown as the dashed lines in Fig. 4 , 528 clock cycles are required to decode a received word. In addition, eight RAMs of 256 bytes are required for pipelining. Employing composite field arithmetic, the critical path of a GF (2 8 ) multiplier has 6 XOR gates and 1 AND gate. Accordingly, the critical path of the LCC decoder has 9 gates. On Xilinx Virtex-II FPGA devices, a clock period of 8ns can be achieved for this decoder. Hence the LCC decoder can easily achieve a throughput of around 500Mpbs on FPGA devices. In ASIC implementations, our decoder can achieve a throughput of at least several gigabits per second.
Next, the complexity of a hard-decision decoder based on the BMA is compared to that of the LCC decoder. The architectures of the BMA are scalable. For the purpose of comparison, they are scaled to achieve about the same throughput as the LCC decoder. The architecture for the syndrome computation can be found in [31] . The syndrome computation is to evaluate the polynomial associated with the received word over the n -k roots of the generator polynomial of the RS code. One syndrome can be computed using one multiplieradder loop in 255 clock cycles for the RS (255, 239) code. To finish the syndrome computation in about 528 clock cycles, (n -k) /2 = 8 copies of the multiplier-adder loop are required.
An ultra-folded key equation solver architecture is presented in [32] . With two multipliers and one adder, this architecture can compute both the error locator and magnitude polynomials in 400 clock cycles. The Chien search in the BMA for the RS (255, 239) code needs to be carried out over 255 finite field elements for a degree eight error locator polynomial. It can be finished by an architecture with four copies of multiplier and adder in 510 clock cycles. In the BMA, the roots of the error locator polynomial are actually the inverse of the error locations. Hence, a ROM of 256 bytes is required to derive the actual error locations. To reduce the number of pipelining stages, the Forney's algorithm can be implemented in parallel with the Chien search. Accordingly, the extra latency required by the Fornery's algorithm is listed as '0' in Table III . In this case, four copies of multiplier and adder are required to calculate the evaluation values of the polynomial in the numerator. In addition, another multiplier and an inverter are needed to compute the error magnitudes from the evaluation values. The hardware requirement of the hard-decision decoder is summarized in Table III . The critical path of the harddecision decoder also consists of one multiplier, one adder and one multiplexor. Similarly, pipelining cutsets can be added after the syndrome computation and key equation solver to achieve higher speed. In this case, the decoding of each received word takes 510 clock cycles. In addition, three RAMs of 256 bytes are required to store the hard-decisions until the errors are computed.
Using composite field arithmetic, each GF(2 8 ) multiplier consists of 64 XOR gates and 48 AND gates. Each AND or OR gate requires 3/4 of the area of an XOR gate, each Mux or memory cell has the same area as an XOR, and each register occupies about three times of the area of an XOR. Taking this into account, the area requirement of the LCC decoder is around 2.7 times of that of the hard-decision decoder. In addition, the critical path is the same in both decoders, and the number of clock cycles in each pipelining stage is about the same. Hence, the two decoders can achieve about the same throughput. However, since the LCC decoder has one more pipelining stage, 528 more clock cycles need to be waited before the first decoded message appears at the output.
VI. CONCLUSION
This paper presented an efficient high-speed VLSI architecture for soft-decision LCC decoding. Employing the backward interpolation and eliminating the factorization step are major factors contributing to the high speed and efficiency of the decoder. The proposed decoder can achieve a throughput of several gigabits per second in ASIC implementations. Compared to a hard-decision decoder with the same throughput, the soft-decision LCC decoder requires less than three times the area. As a result, it is feasible to employ soft-decision LCC decoding in practical applications.
