Abstract-Reed-Solomon (RS) codes are one of the most extensively used error control codes in digital communication and storage systems. Recently, significant advancements have been made on algebraic soft-decision decoding (ASD) of RS codes. These algorithms can achieve substantial coding gain with polynomial complexity. One major step of ASD is the interpolation. Various techniques have been proposed to reduce the complexity of this step. Further speedup of this step is limited by the inherent serial nature of the interpolation algorithm. In this paper, taking the bit-level generalized minimum distance (BGMD) ASD as an example, we propose a novel technique to combine the computations from multiple interpolation iterations. Compared to the single interpolation iteration architecture for a (255, 239) RS code, the combined architecture can achieve 2.7 times throughput with only 2% area overhead in high signal-to-noise ratio scenarios.
I. INTRODUCTION
Reed-Solomon (RS) codes are among the most widely used error-correcting codes in digital communication and data storage systems. Recently, significant advancements have been made on algebraic soft-decision decoding (ASD) [1] - [3] of RS codes. By incorporating the probability information from the channel into the algebraic interpolation process developed by Guruswami and Sudan [4] , significant coding gain can be achieved by ASD, while their complexity is polynomial with respect to the codeword length. ASD algorithms consist of two major steps: the interpolation and the factorization. In this paper, we focus on the interpolation step, which is the speed bottleneck of ASD algorithms.
The most popular interpolation algorithm is the Nielson's algorithm [5] , [6] . This is an iterative algorithm and each iteration mainly consists of discrepancy coefficient computation and candidate polynomial updating. Various techniques have been proposed to reduce the complexity of the interpolation. The total number of iterations involved in the interpolation can be reduced by employing the re-encoding and coordinate transformation techniques proposed in [7] , [8] . Optimizing the computations involved in each iteration is another approach to reduce the interpolation complexity. The latency of the discrepancy coefficient computation can be reduced by the point-serial scheme proposed in [9] , [10] . Based on this scheme, a systolic array architecture has been proposed for the interpolation step [9] . In addition, deep pipelining of the interpolation architecture is enabled by using a hybrid representation for finite field elements [11] . In [12] , it is found that a significant proportion of the discrepancy coefficients are zero and the updating of corresponding candidate polynomials can be skipped. Accordingly, the number of hardware units implementing candidate polynomial updating can be reduced substantially. Although these efforts led to faster and smaller area implementation of the interpolation, further speedup is limited by the inherent serial nature of the Nielson's algorithm.
In this paper, we present a novel technique to combine the computations from multiple iterations of the interpolation. To illustrate our idea, we use the bit-level generalized minimum distance (BGMD) ASD with maximum multiplicity two as an example. Multiple interpolation iterations need to be carried out for an interpolation point with multiplicity larger than one. For the iterations associated with the same interpolation point, the discrepancy coefficients can be computed in a 'look-ahead' manner. Based on the computed discrepancy coefficients, the candidate polynomial updating in those iterations can be combined efficiently. In the BGMD algorithm, there may also be pairs of interpolation points with multiplicity one that share the same X coordinate. Only one interpolation iteration is required for each of these points. However, the discrepancy coefficients for each pair can be computed simultaneously with small overhead due to the same X coordinate. Accordingly the corresponding polynomial updating can be also combined. The savings can be brought by the proposed scheme is dependent on the multiplicities of the interpolation points, which are decided by channel conditions. In high signal-to-noise ratio (SNR) scenarios, our proposed combined interpolation architecture for the BGMD decoding of a (255, 239) RS code can achieve 2.7 times throughput compared to single interpolation iteration architectures, while the area requirement has only been increased by 2%. In terms of speed/area ratio, our architecture is 165% more efficient. At low SNR, our architecture is 70% more efficient.
The structure of this paper is as follows. Section 2 introduces BGMD ASD algorithm and the Nielson's interpolation algorithm. The proposed combined interpolation scheme is presented in Section 3. Then in Section 4, efficient architectures are proposed for the combined interpolation. Section 5 gives the hardware requirement and latency analyses. Conclusions are provided in Section 6.
II. THE BGMD ALGORITHM AND THE NIELSON'S INTERPOLATION
In this paper, we consider RS codes constructed over finite field GF(2 q ). For a primitive RS codes, n = 2 q − 1. For k
The encoding of RS codes can be carried out by evaluating the message polynomial at n distinct nonzero elements of GF(2 q ). Assume the fixed-ordered set {x 0 , x 1 , ··· , x n−1 } is chosen as the n distinct evaluation elements, the codeword corresponding to
ASD algorithm decoding consists of three steps: the multiplicity assignment, the interpolation, and the factorization. As the first step, the multiplicity assignment receives the reliability information from the channel and then decides on the multiplicities of the interpolation points. To increase the probability that the correct message polynomial can be recovered, usually assign larger multiplicities to those more reliable points. Popular multiplicity assignment schemes include the Koetter-Vardy (KV) [1] , low-complexity Chase (LCC) [2] and the BGMD scheme [3] . In this paper, we will use the BGMD scheme with maximum multiplicity two as an example to illustrate the proposed technique. In the BGMD algorithm, the multiplicities are assigned by using bit-level reliabilities. Each received symbol, θ j , consists of q noise-corrupted bits. Define the loglikelihood ratio (LLR) of a noise-corrupted bit, δ , by LLR δ = log(Pr(0|δ )/Pr(1|δ )). δ is considered to be erased if |LLR δ | is smaller than a threshold. Assume the maximum multiplicity of the interpolation points is m max , the BGMD algorithm assigns multiplicities based on the number of bits erased in each received symbol according to the following scheme.
Algorithm A: BGMD Multiplicity Assignment 1) if no bit is erased in θ j , assign m max to (x j , y j ), where y j is the hard-decision of θ j ;
2) if there is only one bit erased in θ j , assign m max /2 to both (x j , y j1 ) and (x j , y j2 ), where y j1 and y j2 are the hard-decision of θ j and the field element that differs from the hard-decision in only the erased bit, respectively;
3) if there are more than one bit erased in θ j , do not assign any multiplicity to the interpolation points with x j .
Although ASD algorithms differ in the multiplicity assignment step, they share the same interpolation and factorization steps. Some definitions necessary for understanding the interpolation-based decoding algorithms are given below. Definition 2 For non-negative integers w x and w y , the (w x , w y )-weighted degree of a bivariate polynomial
The purpose of the interpolation step is to find a bivariate polynomial, Q(X,Y ), with minimum (1, k − 1) weighted degree that passes each non-trivial interpolation point with at least its associated multiplicity. Then the factorization finds all factors of Q(X ,Y ) in the form of Y − f (X ) with the degree of f (X ) at most k. These derived polynomials contain all the message polynomials in the list as a subset. In this paper, we focus on the interpolation step. Popular interpolation algorithms include the Nielson's algorithm and the Lee-O'Sullivan (LO) algorithm [13] . However, the LO algorithm can not take in more points once the interpolation started. In this paper, we use the Nielson's algorithm in our design. The pseudo codes of this algorithm are listed in Algorithm B.
Algorithm B: Nielson's Interpolation
initialization:
In Algorithm B, the discrepancy coefficient, d
where q
The basic idea of Algorithm B is to first initialize a set of t + 1 bivariate candidate polynomials
Then these polynomials are updated to satisfy one extra interpolation constraint at a time at the cost of minimum increase in (1, k − 1)-weighted degree. After the constraints for all interpolation points have been satisfied, the candidate polynomial with the least weighted degree is selected as the output. It can be derived that an interpolation point of multiplicity m adds m(m + 1)/2 constraints. Since one extra constraint is satisfied in each iteration, m(m + 1)/2 interpolation iterations need to be carried out for a point of multiplicity m. The re-encoding and coordinate transformation [7] , [8] can convert the Y coordinate of the k most reliable interpolation points to zero. As a result, the interpolation over these points can be presolved by simple univariate interpolation. In addition, the point-serial scheme [9] , [10] and a hybrid finite field element representation [11] can be employed to increase the speed of the discrepancy coefficient computation. Further speedup of the interpolation is limited by the serial nature of Algorithm B.
III. THE COMBINED INTERPOLATION
Due to the data dependency, the computations from multiple interpolation iterations can not be carried out simultaneously. In this section, taking the BGMD algorithm as an example, we propose a novel and efficient scheme that combines the candidate polynomial updating of multiple iterations.
In the BGMD decoding with maximum multiplicity two, each coordinate position can have one interpolation point with multiplicity two, two interpolation points with multiplicity one or no interpolation point. In this case, it can be derived that there are three candidate polynomials involved in the interpolation for high rate codes (t = 2 in Algorithm B). Now consider the interpolation over a point, (x 0 , y 0 ), of multiplicity two, which requires three iterations for the constraints (α, β ) = (0, 0), (0, 1) and (1, 0) in Algorithm B. Since the interpolation point is the same in these three iterations, (x 0 , y 0 ) is dropped from the d 
In the remaining of this paper, we use d
will not be picked as the minimum polynomial and does not need to be updated in the iteration of (α, β ) = (0, 1). On the other hand, d
(1)
can be the minimum polynomial in the iteration of (α, β ) = (0, 1), depending on their weighted degrees and whether d
0,1 is zero. Assume Q (1) (X,Y ) is picked as the minimum polynomial, the candidate polynomials are updated again according to the I3 and I4 steps. Similarly, Q (1) (X ,Y ) has zero discrepancy coefficient in the last iteration of (α, β ) = (1, 0), and will not be the minimum polynomial and does not need to be updated. In the last iteration, from (2), d
1,0 is nonzero. Therefore, the minimum polynomial can be either
Depending on whether Q (0) (X ,Y ) has been picked as the minimum polynomial twice or once in the three iterations, the updated polynomials from the last iteration can be written in terms of the polynomials at the beginning of the first iteration in one of the two formats below:
From (1), it can be observed that d
and d
can be computed as byproducts of the d
0,0(0,0) computation. Hence, compared to the discrepancy coefficient computation in the iteration of (α, β ) = (0, 0), the computation of Δ j only requires small overhead to derive the sum of products.
As it can be observed from (3) and (4), Q (1) (X ,Y ), which is picked as the minimum polynomial in the iteration of (α, β ) = (0, 1), is the same for both cases. In addition, the Q (2) (X ,Y ) for the first case is the same as the Q (0) (X ,Y ) for the second case. When different candidate polynomials are picked as the minimum polynomials in the three iterations, the output polynomials at the end of the third iteration are in the same format as either (3) or (4), except that the candidate polynomials are switched. From our extensive simulations, we did not find any case that has all three discrepancy coefficients as zero in a iteration.
In the BGMD algorithm, there may also be pairs of interpolation points with multiplicity one that share the same X coordinate. For each of these points, only one interpolation iteration needs to be carried out. Although the discrepancy coefficient for one point can not be computed as the byproduct of the discrepancy coefficient computation of another, the discrepancy coefficients for the two points with the same X coordinate can be computed simultaneously with small overhead due to the shared coordinate and low Ydegree. Accordingly, the corresponding candidate polynomial updating can be also combined. In addition, the polynomial picked as the minimum polynomial during the interpolation over the point (x j , y j1 ) is multiplied by (X + x j ). Hence this polynomial has zero discrepancy coefficient during the interpolation over the point (x 1 , y j2 ), and will not be picked as the minimum polynomial again. Assume Q (0) (X ,Y ) and Q (1) (X,Y ) , respectively, are the minimum polynomials in the interpolation iterations for two points of multiplicity one that share the same X coordinate. The candidate polynomial updating in these two iterations can be combined as
where Δ i (i = 0, 1, 2) have the same format as those in (5), except that d
should be replaced by d
for the second point with the same X coordinate. In the case that different polynomials are picked as the minimum polynomials, the combined polynomials are in the same format as those in (6).
IV. VLSI ARCHITECTURE FOR THE COMBINED INTERPOLATION
In this section, efficient architectures for the combined interpolation are presented. Since all the coefficients in the combined candidate polynomial updating in (3), (4) and (6) are computed based on the polynomial coefficients at the beginning of the iteration of (α, β ) = (0, 0), the subscript (0, 0) is dropped from the notation d
A. Discrepancy Coefficient Computation Architecture
To enable the combining of the candidate polynomial updating, d
(l) α,β for (α, β ) = (0, 0), (0, 1) and (1, 0) need to be computed first based on the starting candidate polynomials in the first iteration. Then Δ i are computed by using (5). The discrepancy coefficient computation equation can be rewritten as
where c
. Substituting α by 0 and 1, respectively, it can be derived that 1 x 0 does not need multiplier and can be done by adding just one adder loop as shown by the bottom branch in Fig. 1(a) . Nine copies of the architecture in Fig. 1(a) Once c
α,β can be derived according to (7) . Since the maximum Y -degree of the polynomials is two, d 1,0 can be computed using the architecture in Fig. 1(b) by routing proper inputs through the multiplexors. In total, three copies of this architecture are required, one for each candidate polynomial. After the discrepancy coefficients are computed, each of Δ i for i = 0, 1, 2 is computed by one copy of the architecture. Then x 0 Δ 3 is computed by using a single architecture. Since all these computations are carried out in the architecture in Fig.  1(b) in a time-multiplexed way, it takes ξ dccb = 4 clock cycles to derive the required coefficients in (3) and (4). The same architectures can be used to compute the coefficients in (6).
B. Polynomial Updating Architecture
After Δ j have been computed, the polynomial updating can be carried out by using the PU architecture shown in Fig.  2 . Since there is no data dependency among the coefficients with different Y degree, they are updated in parallel in our design. Fig. 2 shows the architecture for updating the coefficients with one of the three Y -degrees. In this architecture, the multiplications by (X + x 0 ) are implemented by feeding the polynomial coefficients serially with the least significant one first to the A blocks. It can be observed that the four different polynomials in (3) and (4) have the following common terms:
. These terms are shared in the PU architecture to reduce the complexity of the combined polynomial updating. (3) and (4) lower input to each of the 2-to-1 multiplexor on the right of Fig. 2 , the three polynomials in (3) or (4) are available at the output. In addition, the PU architecture also takes care of the combined polynomial updating for two interpolation points of multiplicity one according to (6) . In this case, the upper input of each multiplexor needs to be passed to the output. As it was mentioned in Section 3, it is possible that other candidate polynomials instead of Q (0) (X ,Y ) and Q (1) (X ,Y ) are picked as the minimum polynomials in the first and second iterations. In this case, the candidate polynomials and their discrepancy coefficients need to be switched. These switching are handled by the 3-to-1 multiplexors in Fig. 2 . It can be noted that an extra multiplier is added to multiply x 0 with d (0) 0,0 . This is because that Δ 3 is scaled by x 0 when it is computed, and hence d (0) 0,0 also needs to be scaled by the same factor to compute Q (2) 
0,0 here requires one more multiplier, nine multipliers have been saved in the DCC architecture by computing x 0 Δ 3 instead of Δ 3 .
In previous designs [10] , [11] , the coefficients of updated polynomials are fed back to their own memory blocks and multiplexors are required for coefficient routing. In our architecture, instead, the three updated polynomials at the output are written back to fixed memory blocks. Although a few extra gates are required to keep track of the weighted degree of the polynomials, the multiplexors for routing are eliminated and hence the critical path as well as total gate number are reduced. To increase the speed, registers are added to break the data path in Fig. 2 into ξ pu = 2 pipelining stages.
C. Computation Scheduling
All blocks in the DCC architecture in Fig. 1 and the PU architecture in Fig. 2 take polynomial coefficients with the least significant one first. Once the coefficients have been updated, they can be passed to the DCC architecture to compute the discrepancy coefficients for the next interpolation iteration. A control unit is employed to generate the control signals for the DCC and PU architectures. It is also used to record and update the weighted degrees of the polynomials. During the interpolation, the degrees of polynomials are either increased by one or two according to (3) and (4) . The degree change can be decided right after the discrepancy coefficients are computed. Hence the degree updating can be carried out in parallel with polynomial updating.
It can be observed that the block B in Fig. 2 has the same architecture as that in Fig. 1 (b) , if the multiplexors are ignored. All coefficients of q α,β and Δ i are computed. Hence, the architecture in Fig.  1 (b) and the block B in Fig. 2 will not be activated simultaneously. Therefore, by adding one extra input to each of the multiplexors on top of the multipliers in Fig. 1 , a single architecture can be shared for both purposes in a timemultiplexed way. Fig. 1 (b) is not counted in the DCC architecture. They are included in the PU architecture instead. As it can be observed from Table I , the critical path of our combined interpolation architecture consists of 12 gates. From simulations, the maximum X -degree of the candidate polynomial coefficients is 33. Hence a memory of 33 × 3 × 3 = 297 bytes is required to store the polynomial coefficients involved in the interpolation. In our design, the weighted degrees of the polynomials are stored in registers. Taking these registers into account, the total number of registers required in our design is 345.
V. HARDWARE REQUIREMENT AND LATENCY ANALYSES
In the original Nielson's interpolation, one iteration is carried out for each interpolation constraint. Hence only one discrepancy coefficient needs to be computed for each polynomial. This computation can still be carried out by the architecture shown in Fig. 1 by removing unnecessary multiplexors. Compared to prior architectures for discrepancy coefficient computation, our architecture requires less gate in the case that the maximum multiplicity is two. The candidate polynomial updating architecture proposed in [10] has less gate count than other published designs. The critical path in this architecture consists of one multiplier, three Mux and one inverter. A GF(2 8 ) inverter can be implemented by a 256 × 8 look-up table (LUT). In addition, its delay usually equals to that of 2-3 serially concatenated XOR gates. Replacing the delay of the inverter with that of 2 XOR gates, the critical path of single iteration interpolation architecture includes 12 gates. In total, the implementation of this architecture needs 2055 XOR gates, 1375 AND gates, 29 OR gates, 184 Muxes, and 273 registers. Besides the memory TABLE I   GATE COUNTS AND CRITICAL PATHS   area  critical path  Multiplier  64XOR+48AND  6XOR+1AND  Adder  8XOR  1XOR  DCC  784XOR+552AND  7XOR+ 2AND  PU  2152XOR+1488AND+552Mux  8XOR+1AND+3Mux  Control  61XOR+102AND+106OR+82Mux  1XOR+4AND+4OR+3Mux  Total  2997XOR+2142AND+106OR+634Mux  N/A for the LUT, the same amount of memory is needed to store the polynomial coefficients as in our design. Accordingly, the memory requirement is 256 + 297 = 553 bytes. The DCC and PU architectures take in coefficients serially. Hence, the number of clock cycles required in each interpolation iteration is decided by the maximum X-degree of the candidate polynomials. This degree increases uniformly with interpolation iteration. Therefore, the latency of the overall interpolation can be derived by multiplying the average number of clock cycles required by each iteration with the number of interpolation iterations. In addition, in each iteration, extra clock cycles used for pipelining are needed to take into account and the total pipelining latency is ξ dcca + ξ dccb + ξ pu = 1 + 4 + 2 = 7 in our design. From simulations, it can be derived that the average maximum X-degree of the polynomials, d x , is 13 when the signal-to-noise ratio (SNR) is high. Accordingly, the average number of clock cycles required for each combined interpolation iteration is d x + 7 = 20. At low SNR, d x becomes 8. Hence, the average number of clock cycles required for each combined interpolation iteration becomes 15. After applying re-encoding and coordinate transformation, the interpolation only needs to be carried out over at most n − k = 16 interpolation points of multiplicity two, or 16 pairs of points with multiplicity one, or any combination of them. As a result, at most 16 combined interpolation iterations are required in our proposed scheme. In the original Nielson's interpolation, the average number of clock cycles required for each iteration is d x + 5. However, the number of interpolation iterations depends the number of constraints need to be satisfied. At high SNR, most of the interpolation points have multiplicity two, each of which adds 3 constraints. Hence, the number of interpolation iterations is at most 3 × 16 = 48. On the other hand, at lower SNR, most of the interpolation points have multiplicity one, and at least 2 × 16 = 32 iterations are required in this case.
Each AND gate or OR gate occupies 3/4 of the area of an XOR, each memory cell and Mux has the same area as an XOR, and each register requires about 3 times of the area of an XOR. Taking this into account, for a (255, 239) RS code, the area requirement of the proposed interpolation architecture is equivalent to that of 2997 + (2142 + 106) × 0.75 + 634+297×8+345×3 = 8728 XOR gates. The area required by the single iteration interpolation architecture equals that of 2055 + (1375 + 29) × 0.75 + 184 + 553 × 8 + 273 × 3 = 8535 XOR gates. Hence, our architecture only requires 2% more area. The critical path of our architecture is the same as that in the previous one. In high SNR scenarios, our architecture can achieve a throughput of (18 × 48)/(20 × 16) = 2.7 times. Hence, our architecture is 165% more efficient in terms of speed/area ratio. When the SNR is low, our architecture is (13 × 32)/(15 × 16) = 1.73 times faster, and thus is 70% more efficient.
The application of the proposed combined polynomial updating scheme is not limited to the architectures discussed in this paper. Considering the interpolation architectures in [11] , the proposed scheme can also be employed to achieve further speedup.
VI. CONCLUSION
A novel scheme is proposed in this paper to combine interpolation iterations. In our scheme, the discrepancy coefficients from multiple iterations are computed using a 'look-ahead' method. Then the polynomial updating of these iterations are combined. Applying the proposed scheme, our combined interpolation architecture can achieve significant speedup at the cost of negligible area overhead. In the case that more candidate polynomials are involved, there are more possibilities with regard to which polynomial is chosen as the minimum polynomial in each iteration. As a result, combining interpolation iterations would lead to larger number of different formats for the updated polynomials. We will study if the scheme proposed in this paper can be extended to the cases where more polynomials are involved.
