Reed-Solomon (RS) codes are among the most widely utilized error-correcting codes in modern communication and computer systems. Among the decoding algorithms of RS codes, the recently proposed Koetter-Vardy (KV) soft-decision decoding can achieve substantial coding gain, while has a polynomial complexity. One of the major steps of the KV decoding is the factorization. The root computation involved in each iteration level of the factorization is traditionally implemented by exhaustive search. A fast factorization architecture has been proposed to circumvent the exhaustive root search from the second iteration level by using a root-order prediction scheme. However, the root computation in the first iteration level is still carried out by exhaustive search, which accounts for a significant part of the overall factorization latency. In this paper, a novel iterative prediction scheme is proposed to compute the roots in the first iteration level. The proposed scheme can substantially reduce the average latency of the factorization, while only incurs negligible area overhead. Applying this scheme to a (255, 239) RS code, a speedup of 36% can be achieved.
INTRODUCTION
Reed-Solomon (RS) codes have very broad applications. They can be found in magnetic and optical recording, spread spectrum wireless systems, as well as satellite and deep-space communications. Since RS codes were introduced in 1960s, tremendous efforts have been devoted to developing efficient and high-gain decoding algorithms. The well-known Berlekamp-Massey algorithm (BMA) [1] can correct up to Ñ Ò ¾ ´Ò · ½ µ ¾ errors for an´Ò µ RS code. Comparatively, list-decoding algorithms attempt to correct more errors by finding all the codewords within a distance that is longer than Ñ Ò ¾ from the received word. A breakthrough in list-decoding was achieved by the Sudan [2] and Guruswami-Sudan (GS) [3] algorithms. These algorithms are based on an algebraic interpolation technique. By forcing all the interpolation points in the Sudan algorithm to have higher multiplicities, the GS algorithm can correct up to Ò Ô Òerrors. Higher coding gain can be also achieved by soft-decision decoding algorithms through making use of the probability information from the channel. Various softdecision algorithms have been proposed. However, they can either only achieve relatively low coding gain or have exponential complexity. Recently, Koetter and Vardy extended the GS algorithm by incorporating the probability information into the algebraic interpolation process [4] . By forcing the interpolation points with higher reliability to have higher multiplicities, this algorithm can achieve substantial coding gain while its complexity is polynomial with respect to the codeword length.
The major steps of the Koetter-Vardy (KV) algorithm are the interpolation and factorization. Re-encoding and coordinate transformation have been introduced to reduce the interpolation complexity [5, 6, 7, 8] . Applying these techniques, the number of iterations need to be carried out in the factorization can be also reduced. Each iteration of the factorization mainly consists of root computation over finite fields and polynomial updating. The root computation is traditionally implemented by exhaustive search, which requires long latency for long codes. A fast factorization architecture was proposed to reduce the average latency associated with the root computation [9] . In this architecture, the exhaustive root search from the second iteration level can be circumvented with more than 99% probability by employing a root-order prediction scheme. In addition, applying the root-order prediction scheme, the roots of two adjacent iteration levels can be computed in a short time. Based on this feature, a partial parallel factorization architecture was developed to combine the polynomial updating in adjacent iteration levels [10] . However, the speedup of this architecture comes at the expense of significantly increased area requirement. In both of these architectures, the root computation in the first iteration level is still carried out by exhaustive search, which accounts for a significant part of the overall factorization latency.
In this paper, a novel iterative prediction scheme is proposed to compute the roots in the first iteration level. This scheme carries out up to three trial-and-error direct root computations before exhaustive search is employed. As a result, the average latency associated with the first iteration root computation can be significantly reduced. In addition, all the hardware units required by this scheme can be shared with the units already exist in the architectures of [9] and [10] . Therefore, the extra hardware required by this scheme is negligible. Applying the proposed algorithm to a (255, Fig. 1 . The KV decoding algorithm 239) RS code, a speedup of 36% can be achieved, while the extra area requirement is around 1%.
The structure of this paper is as follows. Section 2 describes the factorization step in the KV algorithm and the root-order prediction scheme. Section 3 presents the iterative prediction scheme for the first iteration level root computation. Then the factorization architecture, which has incorporated the proposed scheme, is presented in Section 4. Section 5 summarizes this paper.
FACTORIZATION IN THE KV ALGORITHM
In this paper, we only consider RS codes constructed over finite fields of characteristic two. Fig. 1 illustrates the block diagram of the KV algorithm. The multiplicity computation step decides on the interpolation points and their relative multiplicities according to the probability information from the channel. In practice, this step can be implemented as multiplying a probability information matrix by a nonnegative real number, , then followed by the floor function. Comparatively, the interpolation and factorization steps are much more hardware demanding. For an´Ò µ code, the interpolation step finds a bivariate polynomial É´ µ with minimum weighted degree that passes each non-trivial interpolation point with at least its associated multiplicity. Then the factorization step finds all factors of É´ µ in the form of ´ µ with ´ ´ µµ . Here ´ µ denotes the degree of the polynomial. Among various factorization algorithms, the one proposed in [11] is suitable for efficient hardware implementations. This algorithm can be described by the pseudo codes listed in Algorithm A. 
In Algorithm A, the roots in each output vector form the coefficients of a degree ½ polynomial. The polynomials corresponding to all output vectors contain the ´ µ factors as a subset. Applying re-encoding and coordinate transformation, the number of factorization iteration levels can be reduced from to ¾ , where is the number of errors intend to be corrected in the most reliable code positions. After the total iteration number has been reduced, further speedup of the factorization can be achieved by optimizing the computations involved in each iteration level. As it can be observed from Algorithm A, each iteration mainly consists of the root computation in the F2 step and the polynomial updating in the F3 step. The F1 and F4 steps are trivial. They can be implemented by using a register to keep track of the address displacements. Traditionally, the root computation over finite fields are implemented by exhaustive search. For long codes, the F2 step has very long latency.
The latency of the root computation from the second iteration level can be reduced by the root-order prediction scheme proposed in [9] . From Algorithm A, it can be only derived that the degree of É´¼ µ is at most the order, Ö, of the corresponding root, , in the previous iteration level. However, it is found from simulations that in most cases the degree of É´¼ µ equals Ö and É´¼ µ has a single root of order Ö. Therefore, if a root with order Ö is found in iteration level , then it is predicted that the corresponding É´¼ µ in iteration level · ½ can be expressed as
where Õ is the coefficient of in É´ µ. From (1), it can be derived that
where Û is the minimum positive integer satisfying Ö Û The prediction failure rate of this scheme is very low for practical applications of RS codes. However, ignoring these cases may bring significant performance degradation to the KV algorithm. From the definition of root order, 
Fig. 2. Simulation results for maximum root order
In the fast factorization architecture of [9] , the root computation in the first iteration level is still carried out by exhaustive search, which may account for a significant part of the overall factorization latency. For example, the exhaustive root search for a (¾ ¾¿ ) RS code in the first iteration level can consume 31% of the fast factorization latency. In the partial parallel factorization architecture of [10] , the factorization latency is reduced by combining the polynomial updating from adjacent iteration levels. However, the latency associated with the exhaustive root search in the first iteration level does not change. Therefore, the first iteration root computation accounts for a even more significant part of the overall latency in this architecture. In the next section, we introduce a novel iterative prediction scheme for the root computation in the first iteration level. This scheme carries out up to three direct root computation trials before exhaustive search is employed. As a result, the average latency for the first iteration root computation is substantially reduced. In addition, this scheme only requires negligible extra area compared to the area required by the entire factorization architecture.
ROOT COMPUTATION IN THE FIRST
ITERATION LEVEL The É´¼ µ in the first iteration level may have multiple roots. From first glance, no information on these roots can be directly derived. However, from simulations, we found it is highly likely that the maximum order of the roots in the first iteration level is very close to the degree of É´¼ µ. Assume ´ É´¼ µµ Ø and the maximum order of the roots in the first iteration level is Ö, Fig. 2 illustrates some simulation results on the probability of Ö Ø and Ö Ø ¾ for (½ ½½) and (¾ ¾¿ ) codes with different values of . These results are obtained under AWGN channel with BPSK modulation. From Fig. 2 , it can be observed that the probabilities of Ö Ø or Ö Ø ¾ decrease with increasing . In addition, it is more likely that the maximum root order is close to ´ É´¼ µµ in the case that the frame error rate (FER) is low.
In the case of Ö Ø, the root can be computed according to (2) . However, the probability of this case is low when the FER is high, code is long or is large. Fortunately, under certain circumstances, we can still compute a root of É´¼ µ directly from its coefficients if the root order is less than Ø. Assume among the roots of É´¼ µ, there is « with order Ö Ø . Then É´¼ µ can be written as
Multiplying out the right side of (4), it can be observed that the coefficients of É´¼ µ equals the sum of the terms listed in Table 1 . In this table, we assume Ö ´Ø Öµ. Otherwise, the terms with negative powers of « should be replaced by zero.
Since the binomial coefficients are computed over ´¾µ, they are either zero or one. Therefore, in Table 1 , there may exist two columns, in which the nonzero binomial coefficients have the same pattern. In this case, each term in one column equals the the term in the same row of the other column multiplied by « × , where × is the difference between the indices of the two columns. Accordingly, « × can be computed by a division between the coefficients of É´¼ µ corresponding to these two columns. For example, in the case of Ö and Ø , the binomial coefficients are listed in Table 2 . It can be observed that the '1's in the column for Õ ¼ have the same pattern as those in the column for Õ ¼ ¿ . In addition, the difference of the power of « between the corresponding terms in these two columns is ¿ . Therefore, « can be computed as Õ ¼ ¿ Õ ¼ . Furthermore, it can be observed from Table 2 that « can be also computed as Õ ¼ ¾ Õ ¼ .
For practical applications, the maximum order of the roots of É´¼ µ does not exceed eight. By exhaustive listing of the binomial coefficient patterns, it can be observed that there exist at least one pair of coefficients that can be Table 3 . In addition, a possible value of ×, which applies to all possible Ø Ö for a given Ö, is provided in Table 3 . For example, if Ö , then as long as ´ ´ µµ , there exist at least one pair of coefficients of É´¼ µ with their quotient equals « . It can be also observed from Table 3 that × always equals to some non-negative integer power of two. In this case, « can be computed from « × by a cyclical shift if normal basis representation for finite field elements is employed. The maximum order of the roots for É´¼ µ in the first iteration level is unknown. The only information we have is that it is close to the degree of É´¼ µ with high probability. Therefore, we start by predicting that É´¼ µ has a single root with order Ö Ø, and compute by (2) . In the case of prediction failure, we make a second prediction: Ö Ø ½, and compute through dividing proper coefficients of É´¼ µ. In the case that the second prediction fails, Ö is decreased by one again. This process can be carried out iteratively until a real root is found or the condition in Table  3 is no longer satisfied. The correctness of the predictions can be checked by observing the coefficients of É´¼ ·µ.
If a real root is found through the above iterative prediction process, it must be the root with the highest order. However, É´¼ µ may have other roots of lower order, all of them needs to be found in the factorization algorithm. Assume a root, , with order Ö is directly computed. Then É´¼ µ ´ · µ Ö ´ µ is computed. Therefore, the coefficients of the shifted factor ´ · µ can be directly observed from the output of the F3 step. Accordingly, ´ · µ ¼ can be solved instead of ´ µ ¼ . Adding back to the computed roots, the roots of ´ µ ¼ can be derived. The roots for polynomials with degree higher than two can not be easily computed. In addition, from Fig. 2 , the probability for the maximum root order to be larger than or equal to Ø ¾ is high. Therefore, we will limit the iterative prediction process to until Ö Ø ¾. Another advantage to stop the process at this point is that, as it can be observed from Table 3 , it is guaranteed that « × can be directly computed from the coefficients of É´¼ µ when Ø Ö ¾. In the case that all predictions failed, exhaustive search is employed to find the roots of É´¼ µ. In summary, the root computation for the first iteration level of the factorization can be carried out according to the iterative prediction procedure illustrated in Fig. 3 . In this figure, the RCBY1 and RCBY2 blocks implement the root computation for ´ µ when its degree is one and two, respectively. 
VLSI ARCHITECTURE FOR THE FACTORIZATION STEP
This section first presents detailed architectures for the computation units required by the proposed first iteration root computation scheme. Then we demonstrate how to incorporate these units into the fast factorization architecture proposed in [9] to further reduce the latency. Employing the iterative prediction scheme, it only takes one inversion and one multiplication to compute a root in the Comp Root block for each prediction. As it was mentioned in the previous section, there may exist more than one pair of coefficients that can be used for root computation. The pairs of coefficients can be carefully selected such that the divisors in these pairs remain the same through the three predictions as much as possible. In this case, the inversions required in later predictions can be saved. For example, if ´ É´¼ µµ , then in the first prediction, can be computed as Õ ¼ Õ ¼ . In the case that this prediction fails, we predict Ö , and ¾ can be computed as Õ ¼ Õ ¼ . Similarly, if Ö , can be computed as Õ ¼ ¿ Õ ¼ . Therefore, by choosing the pairs of coefficients as above, the complicated inversion can be saved in the latter two predictions. For a degree one polynomial ´ µ ½ · ¼ , its root can be computed as ½ ½ ¼ . Hence, the Comp root and RCBY1 can be implemented by similar architectures. In addition, as it can be observed from Fig. 3 , the Comp root and RCBY1 are activated at different time. Accordingly, the Comp root architecture illustrated in Fig. 4 can be used to implement both the Comp root and RCBY1 blocks.
In the Comp root architecture, depends on whether the divisor changes or not in the three predictions, the inverse value can be either computed by the Inv block or can be read from the register. The function of the Fract Power block is to compute from the value of its power by a cyclical shift. This block also includes the conversion to and from normal basis representations. As it has been discussed in the previous section, it is more efficient to compute the root of ´ · µ instead of ´ µ. Hence, the coefficients fed into this architecture for the RCBY1 implementation are actually proper coefficients of É´¼ · µ. Accordingly, needs to be added back to the output of the Fract Power block to de- The roots of a degree two polynomial can be computed directly by the algorithm developed in [1] . According to this algorithm, the RCBY2 block can be implemented by the architecture introduced in Fig. 5 . Similarly, the root computation are actually carried out on ´ ·µ ½ µ ¾ by a pre-computed matrix. For detailed explanations, the interested reader is referred to [9] . The top left part of this architecture is the same as the MRC2 block in the fast factorization architecture proposed in [9] . The RCBY2 block is only used in the first iteration root computation, while the MRC2 block is used from the second iteration level. Therefore, the RCBY2 architecture can be shared for both blocks. Accordingly, two multiplexors are added to the 'roota' and 'rootb' outputs, as illustrated at the bottom of Fig. 5 , to enable resource sharing.
Incorporating the iterative prediction scheme for the first iteration level root computation, the factorization algorithm can be implemented by the architecture illustrated in Fig.  6 . Compared to the fast factorization architecture in [9] , the only differences are the blocks shown in shade: the Comp root block replaced one of the RC3 block, the RCBY2 block replaced the MRC2 block and two multiplexors are added. In addition, the Control block is slightly different and some connections are added. The PU and PS units implement the F3 step, the F4 and F1 steps, respectively. The detailed description of how the fast factorization architecture works can be found in [9] . Compared to the RC3 and MRC2 architecture, the Comp root and RCBY2 architecture as illustrated in Fig. 4 and Fig. 5 requires an extra of four adders, three 2:1 The factorization architecture is pipelined, and the critical path consists of 10XOR gates and 1 AND gate. Table  4 summarizes the number of pipelining stages each block in , È Ê ½ ±, È Ê ¾ ± and È Ê ¿ ¿ ± at Ê ½ ¼ ¿ In this case, the first iteration root computation takes an average of 42 clock cycles. Compared to the exhaustive search approach, which requires 256+3 clock cycles out of the overall latency of 823 clock cycles, the proposed iterative prediction scheme brings an speedup of 36%. Since the prediction failure rates decrease with FER, the speedup can be brought by this scheme will be more significant for applications requiring lower FER.
CONCLUSION
A novel iterative prediction scheme is proposed in this paper to speed up the first iteration root computation in the factorization step of the KV algorithm. This scheme carries out up to three trial-and-error direct root computations before exhaustive search is employed. In addition, all the computation units required for this scheme can be shared with the blocks already exist in the fast factorization architecture. Hence, the extra area required by incorporating this scheme is negligible. The speedup can be brought by the proposed scheme is heavily dependent on code rate, code length, maximum interpolation multiplicity and channel model. Further study for codes with different settings need to be carried out to optimize this scheme.
