High-speed VLSI Architecture for Low-complexity Chase Soft-decision Reed-Solomon Decoding by Xinmiao Zhang
High-speed VLSI Architecture for Low-complexity
Chase Soft-decision Reed-Solomon Decoding
Xinmiao Zhang
Case Western Reserve University
Email: xinmiao.zhang@case.edu
Abstract—Interpolation-based algebraic soft-decision decoding
(ASD) of Reed-Solomon (RS) codes can achieve signiﬁcant
coding gain with polynomial complexity. Among available ASD
algorithms, the low-complexity Chase (LCC) algorithm can
achieve a good performance-complexity tradeoff. In addition,
the multiplicity of each interpolation point involved in this
algorithm is one. These features make the LCC decoding very
attractive for practical hardware implementation. In this paper,
we present an efﬁcient and high-speed VLSI architecture for the
implementation of the LCC decoder. ASD algorithms have two
major steps: interpolation and factorization. The high efﬁciency
of the LCC interpolation architecture is achieved by employing
a backward interpolation technique, which enables the sharing
of intermediate interpolation results. We also show that the
factorization step can be eliminated in the case of LCC decoding.
From critical path and latency analysis, the LCC decoder can
achieve a throughput of several gigabits per second in ASIC
implementations. In addition, the LCC decoder requires less than
three times the area of a hard-decision decoder that has the same
throughput.
I. INTRODUCTION
Reed-Solomon (RS) codes are used as error-correcting
codes in many applications, such as computer hard drives,
wireless and optical communications, and deep-space prob-
ing. Currently, the hard-decision Berlekamp-Massey algorithm
(BMA) [1] is employed in practical systems to decode RS
codes, due to the existence of very high speed hardware
implementations. However, the BMA can only correct errors
up to half the minimum distance of the code. Numerous
research has been carried out on soft-decision RS decoding
algorithms, which can correct more errors by making use
of the reliability information from the channel. Nevertheless,
previous soft-decision decoding algorithms either can only
achieve limited coding gain or have very high complexity.
Algebraic soft-decision (ASD) decoding algorithms [2], [3],
[4], [5], [6], [7] for RS codes have been developed recently. By
incorporatingthe probability informationfrom the channel into
the algebraic interpolation process developed by Sudan and
Guruswami [8], [9], these algorithms can achieve signiﬁcant
coding gain with a complexity that is polynomial with respect
to the codeword length.
ASD algorithms consist of three steps: multiplicity assign-
ment, interpolation and factorization. The multiplicity assign-
ment step affects the overall error-correcting performance and
This work is supported by NSF under grants 0846331 and 0835782
complexity of the ASD algorithm. For the purpose of practical
implementation, simple multiplicity assignment schemes are
preferred. The multiplicity assignment in the Kotter-Vardy
(KV) algorithm [2] can be implemented by constant multi-
plications followed by the ﬂoor function, and those in the
low-complexity Chase (LCC) [6] and bit-level generalized
minimum distance decoding (BGMD) [7] algorithms can be
implemented by comparators. Smaller multiplicities translate
to lower complexity in the interpolation and factorization
steps. On the other hand, smaller multiplicities do not always
lead to inferior error-correcting performance. For example,
with multiplicity one and eight test decoding, the LCC algo-
rithm can achieve similar or higher coding gain than the KV
algorithm with maximum multiplicity four for a (255, 239)
RS code [10].
In this paper, we present a high-speed VLSI architecture
for the implementation of the LCC decoder. For an (n,k) RS
code, the LCC algorithm carries out test decoding on multiple
vectors of n interpolation points with multiplicity one. The
re-encoding and coordinate transformation techniques [11],
[12] can be applied to exclude k points from the interpolation
process. Nevertheless, if the interpolationis carried out on each
test vector from scratch, the extra complexity of interpolating
over multiple vectors may offset the savings brought by the
small multiplicity. Fortunately, the test vectors in the LCC
decoding share common points. The backward interpolation
scheme [13] can be employed, such that the interpolation result
of a vector can be derived from that of another by taking
care of only the points that are different between the two
vectors. The factorization can still be carried out directly on
the interpolation output when the re-encoding and coordinate
transformation are applied. In this case, although a hard-
decision decoding post-processing is required to recover the
actual errors, the number of iterations need to be carried
out in the factorization can be substantially reduced. The
factorization architecture can be further simpliﬁed when the
multiplicity is one. However, it still accounts for a signiﬁcant
part of the overall area of the LCC decoder. Recently, it was
discovered that the error locations and magnitudes can be
computed directly from the interpolation output in the case
of LCC decoding [14]. As a result, the factorization step and
the key-equation solver in the post-processing hard-decision
decoding can be eliminated. The details of the factorization-
free LCC decoder employing the backward interpolation arepresented in this paper. In addition, the complexities of the
LCC and BMA decoders are compared.
The structure of this paper is as follows. Section II intro-
duces the LCC decoding, and the re-encoding and coordinate
transformation techniques. Section III presents the backward
interpolation and corresponding architectures. How to derive
the error locations and magnitudes without factorization is
described in Section IV. Section V presents comparisons of the
LCC and BMA decoders. Conclusions are provided in Section
VI.
II. LCC DECODING, RE-ENCODING AND COORDINATE
TRANSFORMATION
Without loss of generality, RS codes constructed over
GF(2q) (q ∈ Z+) are considered in this paper. For a
primitive (n,k) code, n = 2q − 1. The encoding of RS
codes can be done by considering the k message symbols
f1,f2,   fk−1 as the coefﬁcients of the message polyno-
mial f(x) = f0 + f1x +    fk−1xk−1, and then eval-
uating f(x) at n distinct nonzero elements of GF(2q).
Assume α0,α1,    ,αn−1 are the evaluation elements with
ﬁxed order, the codeword corresponding to the message
f = (f1,f2,    ,fk−1) is c = (f(α0),f(α1),    ,f(αn−1)).
According to this evaluation mapping encoding, the message
polynomial can be recovered by interpolating over the points
(α0,f(α0)),(α1,f(α1)),    ,(αn−1,f(αn−1)). However, the
codeword might be corrupted during the transmission. Given
the observation of the received symbol at the jth code po-
sition, the associated interpolation points can include (αj,ω)
for any ω ∈ GF(2q). ASD algorithms put higher weight on
those more reliable points during the interpolation in order to
increase the probability that the correct message polynomial
can be recovered.
ASD algorithms consist of three steps: multiplicity assign-
ment, interpolation and factorization. They are different in the
multiplicity assignment step, and share the same interpolation
and factorization steps. The multiplicity assignment decides
the interpolation points and their multiplicities by making
use of the reliability information from the channel. This step
affects not only the error-correcting performance of the ASD
algorithm, but also the complexity of the following two steps.
In the LCC decoding, there are 2η (η ∈ Z+,η < n − k)
test vectors of n interpolation points. Although each point has
multiplicity one, the reliability information is incorporated in
the decision of the interpolation points. Each of the η most
unreliable code positions is assigned two interpolation points:
(αj,βj) and (αj,β′
j), where βj is the hard-decision of the
jth received symbol and β′
j differentiates from βj in only the
least reliable bit. For the rest of the n − η code positions,
only one interpolation point, (αj,βj), is assigned. The test
vectors are formed by picking one interpolation point for each
code position. Since there are two possible points for each
unreliable code position, the total number of test vectors is
2η.
The multiplicity assignment in the LCC decoding can be im-
plemented by comparators. On the contrary, the interpolation
and factorization steps are much more hardware-demanding.In
addition, the interpolation and factorization need to be carried
out on each test vector in the LCC decoding. The function of
the interpolation is to ﬁnd a polynomial Q(x,y) of minimum
(1,k−1) weighted degree that passes each interpolation point
with its associated multiplicity. Then the factorization step
computes all factors of Q(x,y) in the form of y − f(x)
with the degree of f(x) less than k. Here the computed f(x)
form a list of message polynomials. A bivariate polynomial
Q(x,y) is said to pass a point (α,β) with multiplicity m
if Q(x + α,y + β) contains a monomial xayb with degree
a + b = m, but does not contain any monomials with degree
less than m. The (wx,wy) weighted degree of a bivariate
polynomial Q(x,y) =
 ∞
i=0
 ∞
j=0 qi,jxiyj is deﬁned as the
maximum of iwx + jwy such that qi,j  = 0.
The complexity of the LCC and other ASD algorithms
can be reduced by applying the re-encoding and coordinate
transformation techniques [11], [12]. The basic idea of the re-
encoding is to ﬁrst pick the k most reliable code positions
in the received word r, and denote them by the set R. Then
an erasure decoding is applied to the k symbols of r with
index in R to derive another codeword φ. Since ¯ c = c + φ is
also a codeword, the error vector, e, of the codeword c can be
found by decoding ¯ r = r+φ = c+e+φ = ¯ c+e instead. The
advantage of decoding ¯ r instead is that the k symbols in ¯ r with
index in R are zero. Accordingly, the interpolation over these
points can be pre-solved as Πi∈R(x + αi) and the expensive
bi-variate interpolation only needs to be carried out on the
points in the rest n−k code positions. In addition, coordinate
transformation can be applied to factor out Πi∈R(x + αi),
which is now a common term of all polynomials involved
in the bivariate interpolation. As a result, the length of the
polynomials and the memory requirement of the interpolation
can be also reduced.
The factorization can still be applied directly to the inter-
polation output when the re-encoding and coordinate transfor-
mation are employed. In this case, the actual errors in the k
most reliable code positions can be recovered after a hard-
decision decoding, such as BMA, in which the factorization
outputs are used as syndromes. If τ errors need to be corrected
in the k most reliable code positions, 2τ syndromes need to
be computed from the factorization. The number of iterations
in the factorization equals the number of symbols need to be
computed. Originally, k symbols need to be computed as the
coefﬁcients for each f(x) factor in the factorization. Since 2τ
can be set to a number that is much smaller than k, the com-
plexity of the factorization can be also signiﬁcantly reduced as
a result of re-encoding and coordinate transformation, despite
the extra hard-decision decoding. After the errors in the k
most reliable code positions have been corrected, an erasure
decoding can be applied to recover the transmitted codeword
c. This decoding process is illustrated in Fig. 1 [14].
Besides the re-encoding and coordinate transformation,
other techniques and architectures have been proposed to
reduce the complexity of the interpolation [15], [16], [17],
[18], [19], [20]. These architectures can be further simpliﬁedMultiplicity 
assignment
Re-
encoder
Interpolation Factorization BMA Erasure 
decoder
Channel
information
r = c+e
r = c+e   c e
Fig. 1. The re-encoded and transformed LCC decoder
in the case of multiplicity one. However, in the LCC decoding,
the interpolation needs to be carried out on each test vector.
Starting the interpolation for each vector from scratch may
offset the savings brought by the small multiplicity. Two
interpolation algorithms can be employed for practical imple-
mentations: the Nielson’s algorithm [21], [22] and the Lee-
O’Sullivan (LO) algorithm [23]. Although the LO algorithm
has lower complexity when the maximum multiplicity is less
than three [20], it does not allow interpolation points and
their multiplicities to be changed once the interpolation started.
Hence, the point-by-point Nielson’s algorithm is employed in
our LCC decoder design in order to enable the sharing of
intermediate interpolation results. A backward interpolation
scheme for the LCC decoding has been proposed in [13] to
eliminate points from given interpolation results. Employing
this scheme, the interpolation over the second and later test
vectors only needs to take care of the points that are different
from the previous vector. Section III presents the details of
this scheme.
The factorization problem can be solved by using the
iterative algorithm proposed by Roth and Ruckenstein [24].
When the y-degree of Q(x,y) is larger than two, the bot-
tleneck of this algorithm lies in the exhaustive-search-based
root computation over ﬁnite ﬁelds required in each iteration.
Several architectures have been proposed to increase the speed
of the root computation and factorization [25], [26], [27],
[28]. In the case of LCC decoding, the y-degree of Q(x,y)
is one. Accordingly, the root computation only needs to be
done for degree one polynomials. Hence it is no longer a
bottleneck. A selection technique has been proposed in [6]
to pass the interpolation output of only one test vector to the
factorization step at the cost of small performance degradation.
Nevertheless, the factorization architecture still accounts for
a signiﬁcant proportion of the overall decoder area. It was
discover in [14] that the factorization and the key equation
solver in the following hard-decision decoder can be actually
eliminated in the case of LCC decoding. This scheme will be
detailed in Section IV.
III. BACKWARD INTERPOLATION
The test vectors in the LCC decoding can be ordered
such that the adjacent vectors only have one pair of
points different, and the different points are in the form of
(αj,βj) and (αj,β′
j). If (αj,βj) can be eliminated from the
interpolation result of the current vector, then the interpolation
result for the next vector can be derived by adding (αj,β′
j)
using the Nielson’s algorithm. In this case, the interpolation
for the second and later test vectors only needs to take care
of the different points. Accordingly, signiﬁcant computation
reduction can be achieved. Eliminating points from a given
interpolation result is referred to as the backward interpolation,
while adding points using the Nielson’s algorithm is called
the forward interpolation. The backward interpolation is built
upon the Nielson’s algorithm. Hence, the Nielson’s algorithm
is described ﬁrst in the following.
Algorithm A: The Nielson’s Interpolation
initialization:
Q(0)(x,y) = 1,Q(1)(x,y) = y,    ,Q(t)(x,y) = yt
Wdeg0 = 0,Wdeg1 = k − 1,    ,Wdegt = t(k − 1)
interpolation starts:
for each interpolation point (α,β) with multiplicity m
for a = 0 to m − 1 and b = 0 to m − a − 1
A1: compute d
(l)
a,b(α,β),(0 ≤ l ≤ t)
A2: u = argminl(Wdegl|d
(l)
a,b(α,β)  = 0,0 ≤ l ≤ t)
for l = 0 to t, l  = u
A3: Q(l)(x,y) ⇐ d
(u)
a,b(α,β)Q(l)(x,y)
+ d
(l)
a,b(α,β)Q(u)(x,y)
A4: Q(u)(x,y) ⇐ Q(u)(x,y)(x + α)
Wdegu ⇐ Wdegu + 1
Output: Q(ϕ)(x,y)(ϕ = argminl(Wdegl|0 ≤ l ≤ t))
In Algorithm A, the discrepancy coefﬁcient d
(l)
a,b(α,β) is the
coefﬁcient of xayb in Q(l)(x+α,y +β). It can be computed
as
d
(l)
a,b(α,β) =
 
r≥a
 
s≥b
 
r
a
  
s
b
 
α
r−aβ
s−bq
(l)
r,s.
The point-by-point Nielson’s interpolation algorithm ﬁrst ini-
tializes a set of t + 1 candidate polynomials. t is determined
by the total number of interpolation constraints and the lex-
icographical order of monomials according to the (1,k − 1)
weighted degree. For high rate codes, t equals the maximum
interpolation multiplicity. The interpolation constraints are
satisﬁed one after another. In the iteration for constraint
(a,b) of point (α,β), the discrepancy coefﬁcient d
(l)
a,b(α,β)
is computed for each candidate polynomial. If this coefﬁcient
is zero, it means that the constraint (a,b) of point (α,β) is
already satisﬁed by the corresponding polynomial. Otherwise,
the polynomials are updated in steps A3 and A4 to force these
coefﬁcients to zero. In addition, the polynomial updating does
not affect the interpolation constraints that have already been
satisﬁed in previous iterations. At the end of each iteration, thecandidate polynomials form a Gr¨ obner basis of the module
consisting of polynomials with maximum y-degree t that
satisfy all previously covered interpolation constraints. Since
the polynomial with minimum weighted degree in the Gr¨ obner
basis is the polynomial with minimum weighted degree in
the module, the desired interpolation output can be found
after iterations for all constraints are carried out. When the
re-encoding and coordinate transformation are applied, the
polynomials can be initialized in the same way. However,
(1,-1)-weighted degree should be used due to the coordinate
transformation.
In the LCC decoding, the maximum interpolation multiplic-
ity is one. Hence, there are only two polynomials involved
in the interpolation and the maximum y-degree of these
polynomials is one. In another word, the polynomials in the
Gr¨ obner basis can be expressed as Q(l)(x,y) = q
(l)
0 (x)+q
(1)
1 y
(l ∈ {0,1}). In order to eliminate a point, (α,β), from a given
Gr¨ obner basis, we want to reverse the computations that have
been carried out during the interpolation over this point. A
point of multiplicity one only requires one iteration in the
interpolation. It can be observed from the A3 and A4 steps
of Algorithm A that during the interpolation over (α,β), the
minimum polynomial Q(u)(x,y) (the polynomial with mini-
mum weighted degree and nonzero discrepancy) is multiplied
by (x + α), and the other polynomial is replaced by a linear
combination of itself and the minimum polynomial. The A3
step is developed based on the property that a polynomial in
a Gr¨ obner basis can be replaced by a linear combination of
itself and another basis polynomial, if the weighted degree
does not change. To eliminate (α,β), (x + α) needs to be
divided from the minimum polynomial and reverse linear
combination needs to be applied to the other polynomial. The
reverse of a linear combination is another linear combination.
Therefore, based on the same property of the Gr¨ obner basis,
the linear combination in the A3 step does not need to be
reversed in order to eliminate (α,β) from the Gr¨ obner basis,
as long as (x +α) is divided from the minimum polynomial.
In addition, the division does not affect other interpolation
points. Accordingly, after the division, the polynomials form
a Gr¨ obner basis that passes all points except (α,β). It might
be noted that this Gr¨ obner basis may not be the same as that
would have been derived by carrying out the interpolation
over all points except (α,β) by using the Nielson’s algorithm.
However, they are Gr¨ obner bases of the same module. The
polynomial with the minimum weighted degree must appear
in any Gr¨ obner basis of the module and is unique up to a
non-zero constant scaler [29].
The minimum polynomial in the interpolation iteration for
the point (α,β) may not be the minimum polynomial in later
interpolation iterations. Hence it can be updated by linear
combinations and the factor of (x + α) may be lost at the
end of the interpolation for all points. Hence, we ﬁrst need to
ﬁnd if the polynomials in the Gr¨ obner basis at the end have
the factor (x + α). It can be derived that Q(l)(x,y) has the
factor (x+α) iff q
(l)
1 (α) = 0. In addition, it has been proved
that the basis polynomials can not all have the factor (x+α)
[10]. It is possible that none of the basis polynomials has the
factor. In this case, assume u = argminl(Wdegl|q
(l)
1 (α)  = 0)
and v = {0,1}\u, an equivalent Gr¨ obner basis that still passes
all interpolation points can be constructed as
 
Q(v)(x,y) ⇐ q
(v)
1 (α)Q(u)(x,y) + q
(u)
1 (α)Q(v)(x,y)
Q(u)(x,y) ⇐ Q(u)(x,y)
(1)
It has been proved that the updated Q(v)(x,y) in (1) contains
the factor (x + α). When there is one basis polynomial
that contains (x + α), this factor can be divided from
the polynomial to form a Gr¨ obner basis that passes all
interpolation points except (α,β). In summary, the backward
interpolation to eliminate a point (α,β) of multiplicity one
from a given interpolation result can be described by the
pseudo codes in Algorithm B.
Algorithm B: Backward Interpolation for the LCC
Decoding
B1: compute q
(l)
1 (α) for l = 0,1
B2: u = argminl(Wdegl|q
(l)
1 (α)  = 0), v = {0,1}\u
B3: Q(v)(x,y) ⇐ q
(u)
1 (α)Q(v)(x,y) + q
(v)
1 (α)Q(u)(x,y)
B4: divide Q(v)(x,y) by (x + α)
Wdegv ⇐ Wdegv − 1
In Algorithm B, the linear combination in step B3 is also
applied when there is one basis polynomial that contains the
factor (x + α) before the linear combination. In this case,
q
(u)
1 (α)  = 0 and q
(v)
1 (α) = 0. Hence the linear combination
in step B3 reduces to a scaler operation, and leads to an
equivalent Gr¨ obner basis. The purpose of applying the linear
combination in both cases is to reduce the complexity and
latency of the control in hardware implementations. It can be
observed that the computations in the backward interpolation
are very similar to those in the Nielson’s forward interpola-
tion. The univariate polynomial evaluation in step B1 can be
computed as an intermediate result of the bivariate polynomial
evaluation in step A1. In addition, the B3 and A3 steps are the
same. Accordingly, computation units can be shared between
these steps. The major difference between the backward and
forward interpolation is that instead of multiplying the factor
(x + α) in the A4 step, this factor is divided in the B4 step.
 
  !   ! 0
l q x
  !   ! 1
l q x
Zero
Detector
!
 
) (
) (
0 "
l q
) (
) (
1 "
l q
) , (
) ( # "
l Q
Fig. 2. The polynomial evaluation (PE) architecture
Fig. 2 [10] shows a polynomial evaluation (PE) architecture
that can carry out both the A1 and B1 steps for the LCC  !   !
0
i q x
  !   !
1
i q x
 
u
 
) (
) 0 (
1 " q
) ' , (
) 0 ( # " Q
) (
) 1 (
1 " q
) ' , (
) 1 ( # " Q
0
0
Fig. 3. The polynomial updating (PU) architecture for the LCC decoding
decoding. Applying the Horner’s rule, the two feedback loops
on the left compute univariate polynomial evaluation values.
They are shared to compute discrepancy coefﬁcients, which
are bivariate polynomial evaluation values in the case of LCC
decoding. The zero detector in the PE architecture can help
to decide the index of the minimum polynomial. In order to
keep the critical path no longer than one multiplier, one adder
and one multiplexor, the PE architecture is pipelined into three
stages.
The polynomial updating (PU) architecture shown in Fig. 3
[10] can implement both the A3, A4 and B3, B4 steps. During
the forward interpolation, the bivariate polynomial evaluation
values are input to the A block. The linear combination at the
output of block A is passed through block B intact by choosing
’0’ as the coefﬁcient of the multiplier in block B. On the
other hand, the minimum polynomial is multiplied by (x+α)
by choosing α as the coefﬁcient of the multiplier in block
C. During the backward interpolation, univariate polynomial
evaluation values are used. In this case, the linear combination
at the output of block A needs to be divided by (x + α).
The division is done in block B by passing α through the
multiplexor to the multiplier. No computation needs to be
carried out on the other polynomial. This can be achieved
by passing ’0’ through the multiplexor in block C. The PU
architecture is also pipelined to limit the critical path to one
multiplier, one adder and one multiplexor.
TABLE I
ITERATION NUMBER AND MEMORY REQUIREMENT COMPARISON OF
INTERPOLATION ARCHITECTURES FOR THE LCC DECODING
# of iterations memory requirement
Forward-only (n − k − η) + 2
η × η 2
interpolation architecture (i)
Forward-only (n − k − η) + 2(2
η − 1) 2
η−1
interpolation architecture (ii)
Backward-forward (n − k) + 2(2
η − 1) 1
interpolation architecture
In the LCC decoding, the forward interpolation can be
used to derive the interpolation output for the ﬁrst test vector.
After that, the interpolation for each of the second and later
test vectors only takes two iterations: one backward and one
forward. During this process, only one interpolation result
needs to be stored at any time. If only forward interpolation
is employed, either (i) the interpolation over the η unreliable
points needs to start from scratch for each test vector, or (ii)
intermediate interpolation results need to be stored after each
unreliable point is added. Table I lists the iteration number and
memory requirement comparisons of different interpolation
architectures. Since most of the computation units required
for the backward interpolation can be shared with those in the
forward interpolation, the area overhead of incorporating the
backward interpolation is very small. In addition, the same
critical path can be achieved in all interpolation architectures.
Hence, our backward-forward interpolation architecture can
achieve either substantial speedup or area reduction compared
to previous architecture. It has been reported in [10] that our
LCC interpolation architecture for a (255, 239) code with
η = 3 can achieve 48% higher efﬁciency in terms of speed/area
ratio than the previous best design. In addition, it can be
observed from Table I, the savings can be brought by our
architecture will become more signiﬁcant when η increases.
The backward interpolation has been extended to the case
of iterative BGMD decoding with maximum multiplicity
mmax = 2 [10]. In the BGMD decoding, depending on the
number of unreliable bits in a symbol, a code position can
have either one interpolation point, (αj,βj), of multiplicity
mmax, two interpolation points, (αj,βj) and (αj,β′
j), of
multiplicity mmax/2, or no interpolation point. In addition,
multiple decoding iterations using different thresholds for bit-
reliability decision can be carried out to achieve higher coding
gain. It has been observed that the BGMD decoding with
mmax = 2 and two decoding iterations can achieve similar
or higher coding gain than the KV algorithm with maximum
multiplicity four [10]. Using the bit-reliability thresholds in a
different sequence does not affect the overall error-correcting
performance of iterative BGMD decoding. If a lower threshold
is used in the next decoding iteration, then the only case can
not be handled by forward interpolation is that a code position
has two points (αj,βj), (αj,β′
j) with multiplicity one in the
current decoding iteration, but has one point (αj,βj) with
multiplicity two in the next decoding iteration. In this case,
(αj,β′
j) can be eliminated similarly from the Gr¨ obner basis
by dividing (x + αj) from a basis polynomial. However, the
same factor has also been multiplied during the interpolation
over (αj,βj). There are two polynomials in the Gr¨ obner basis
containing this factor. We can not tell which polynomial is
multiplied by this factor during the interpolation over (αj,β′
j)
until the quotient is computed and evaluated over (αj,β′
j).
In order to reduce the latency in hardware implementation,
both (αj,βj) and (αj,β′
j) are eliminated simultaneously by
dividing (x+αj) from two polynomials in the basis. Then the
multiplicity of (αj,βj) is increased from zero to two by using
the Nielson’s forward interpolation. Employing this scheme,
the interpolation result for the next decoding iteration can be
derived from that of the current iteration by taking care of
only the different interpolation points. As a result, signiﬁcant
speedup can be achieved compared to starting the interpolation
for the next decoding iteration from scratch. Similar ideas can
be extended to eliminate all points of multiplicity one from a
given interpolation result, if there is no other point with thesame α coordinate.
IV. ELIMINATING THE FACTORIZATION
Applying the re-encoding and coordinate transformation,
¯ r = ¯ c + e is zero in each of the k reliable code positions.
Therefore, ¯ ci = ei for i ∈ R. Assume that the message
polynomial corresponding to ¯ c is ¯ f(x). Then ¯ f(αi) = ci = ei
for i ∈ R. In addition, ei = 0 if there is no error in the ith
position. In this case, (x+αi) is a factor of ¯ f(x). Accordingly,
¯ f(x) =


 
i∈R,ei=0
(x + αi)

δ(x),
where δ(x) is a polynomial that does not have any root
αi for i ∈ R and ei  = 0. As mentioned previously, the
factorization can be applied directly to the interpolation output,
Q(x,y), after the re-encoding and coordinate transformation
have been applied. Since
 
i∈R(x + αi) has been divided
during the coordinate transformation, the factorization will
actually output γ(x) = ¯ f(x)/
 
i∈R(x+ αi), where y − γ(x)
is a factor of Q(x,y). In addition, it can be derived that
γ(x) =
¯ f(x)
 
i∈R(x + αi)
=
δ(x)
 
i∈R,ei =0(x + αi)
. (2)
Accordingly,the coefﬁcients of γ(x) can be used as syndromes
in hard-decision decoding, such as BMA, to recover the errors
in the k most reliable code positions. This further explains the
decoding process illustrated in Fig. 1.
In the case of LCC decoding, the y-degree of Q(x,y) is one.
Hence Q(x,y) can be written in the form of Q(x,y) = q0(x)+
q1(x)y. Accordingly, Q(x,y) = q1(x)(y + q0(x)/q1(x)).
Therefore, γ(x) can be also expressed as
γ(x) =
q0(x)
q1(x)
. (3)
Since δ(x) does not have any root αi for i ∈ R and ei  = 0,
by comparing (2) and (3), it can be derived that
 
q0(x) = p(x)δ(x)
q1(x) = p(x)
 
i∈R,ei =0(x + αi),
where p(x) is the common factor of q0(x) and q1(x). It has
been proved in [14] that p(x) does not contain any factor
(x + αi) for i ∈ R. Therefore, the error locations in the k
most reliable code positions can be found through computing
the roots of q1(x). In addition, from (2) and (3),
¯ f(x) =
q0(x)
 
i∈R(x + αi)
q1(x)
.
Hence, the error magnitudes can be computed by applying the
L’Hopital’s rule:
ei = ¯ f(αi) =
q0(x)(
 
j∈R(x + αj))(d)
(q1(x))(d) |x=αi, (4)
where ( )(d) denotes the formal derivative of the polynomial.
From the above discussion, the error locations and magni-
tudes for the k most reliable code positions can be computed
directly from the interpolation output. Therefore, the factor-
ization and the key equation solver in the following hard-
decision decoding can be eliminated. In addition, the root
computation for q1(x) can be implemented by the exhaustive-
search-based Chien search, which is also required in hard-
decision decoding.It can be also observedthat the computation
in (4) is similar to that in the Forney’s algorithm for error
magnitude computation in hard-decision decoding.
V. COMPLEXITY ANALYSIS AND COMPARISON
After the factorization step and key equation solver have
been eliminated, the LCC decoding can be carried out accord-
ing to the block diagram in Fig. 4 [14]. In this section, the
hardware complexity of the factorization-free LCC decoder for
a RS (255, 239) code constructed over GF(28) with η = 3
is analyzed. In addition, it is compared with that of a hard-
decision decoder based on the BMA. To the best knowledge
of the author, no complexity comparison for hard-decision and
soft-decision RS decoders has been published.
Hard decisions of the received symbols are usually made in
the receiver front end. In addition, the multiplicity assignment
of the LCC decoding can be done using comparators while
the hard decisions are made. Hence, the hardware required for
multiplicity assignment and hard-decision making is excluded
from the comparison. The hardware requirement and latency
of other blocks in the factorization-free LCC decoder are pro-
vided in Table II. Each block has been pipelined if necessary
to make the critical path no longer than one multiplier, one
adder and one 2-to-1 multiplexor.
The re-encoder in Fig. 4 implements the re-encoding and
coordinate transformation. Re-encoding is basically erasure
decoding. A detailed architecture of the re-encoder based on
the BMA can be found in [30]. The BMA consists of three
steps: syndrome computation, key equation solver, and Chien
search and Forney’s algorithm. In the case of erasure decoding,
the key equation solver can be simpliﬁed and the Chien search
can be skipped. Due to the coordinate transformation, the
β coordinates of the interpolation points in code position j
(j ∈ ¯ R) need to be divided by
 
i∈R(αj+αi).
 
i∈R(αj+αi)
can be computed by sharing the hardware for the syndrome
computation in erasure decoding. In addition, the inverter is
implemented by a ROM of 28 × 8 = 256 bytes. The erasure
decoder at the end of the LCC decoder can be implemented
by the same erasure decoder architecture in the re-encoder.
The latency of both the re-encoder and erasure decoder is 528
clock cycles.
The backward-forward interpolation architecture presented
in Section III is adopted in our LCC decoder since it can
achieve higher efﬁciency than previous designs. At the be-
ginning of the interpolation for the (255, 239) code, forward
interpolation is carried out over the 255-239=16 points with
index in ¯ R in the ﬁrst test vector. Since each point is of mul-
tiplicity one, 16 forward interpolation iterations are required.
After that, one pair of backward and forward interpolation
iterations need to be applied to derive the interpolation output
for each of the 2η − 1 = 7 remaining vectors. The numberMultiplicity 
assignment
Re-
encoder
Interpolation
Chien Search 
& Forney’s 
algorithm
Erasure
decoder
Channel
information
r = c+e
r = c +e   e c
Fig. 4. The factorization-free LCC decoder
TABLE II
HARDWARE REQUIREMENT OF THE FACTORIZATION-FREE LCC DECODER
GF(2
8) GF(2
8) Mux ROM RAM Register Latency
Multiplier Adder (bits) (bytes) (bytes) (bits) (# of clock cycles)
Re-encoder [30] 21 39 448 512 0 600 528
Interpolation [10] 14 12 87 0 68 166 525
Polynomial Selection 8 8 139 0 0 264 23
Chien Search 8 8 0 0 0 128 239
Forney’s Algorithm 2 2 136 256 0 24 152
Erasure Decoder [30] 21 39 299 256 0 424 528
Factorization-Free LCC Decoder 74 108 1109 1024 68 + 256 × 8 1606 528
of clock cycles required in each interpolation depends on the
maximum x-degree of the polynomials. From simulations, 525
clock cycles are required for the overall interpolation in the
worst case. In addition, at least 28 clock cycles are required
for each pair of backward and forward interpolation.
The polynomial selection scheme in [6] passes the interpola-
tion output of only one test vector to the following steps of the
LCC decoding. This selection is not explicitly shown in Fig.
1 and 4. It is based on the number of roots of q1(x) in ¯ R. The
exhaustive Chien search root computation needs to be ﬁnished
in 28 clock cycles in order to match the speed of the backward-
forward interpolation. Since | ¯ R| = 16 ﬁnite ﬁeld elements
need to be searched and ⌊28/16⌋ = 1, the searching over
each element needs to be completed in one clock cycle. The
maximum degree of q1(x) is 8. Hence the Chien search can be
implemented by 8 multipliers and 8 adders. Decisions need to
be made based on the root number after the root search. Taking
this into account, the polynomial selection architecture needs
to be pipelined into 7 stages in order to limit the critical path
to one multiplier, one adder and one multiplexor. Therefore
the computation on each interpolation output for polynomial
selection takes 16+7=23 clock cycles, which is still less than
28. Since the polynomialselection engine can ﬁnish processing
one interpolation output before the next one is computed, no
extra memory is required to store each interpolation output.
The selected interpolation output is passed to the Chien
search & Forney’s algorithm block in Fig. 4 to recover the
errors with index in R. Pipelining can be applied between
the functional blocks in the decoder to increase the speed.
On the other hand, the computation in each pipelining stage
should take about the same time in order to increase the
hardware utilization efﬁciency. Taking this into account, the
Chien search for computing the roots of q1(x) with index in R
can be completed in k = 239 clock cycles with 8 multipliers
and 8 adders. The denominator of (4) for error magnitude
computation can be derived as an intermediate result of the
Chien search for q1(x). In addition, q0(αi) can be com-
puted by a multiplier-adder loop in eight clock cycles if the
Horner’s rule is applied. Moreover (
 
j∈R(x+αj))(d)|x=αi=
1/
 
j∈ ¯ R(αi + αj).
 
j∈ ¯ R(αi + αj) can be computed by a
multiplier and an adder in n−k = 16 clock cycles in parallel
with the q0(αi) computation. After that, another three clock
cycles are required to compute the inversion and products.
Accordingly, 19 clock cycles are needed to compute each
error magnitude. There can be 8 correctable errors for the
(255, 239) code in the worse case. Hence the error magnitude
computation takes at most 19 × 8 = 152 clock cycles.
If pipelining is applied to the LCC decoder according to the
cutsets shown as the dashed lines in Fig. 4, 528 clock cycles
are required to decode a received word. In addition, eight
RAMs of 256 bytes are required for pipelining. Employing
composite ﬁeld arithmetic, the critical path of a GF(28)
multiplier has 6 XOR gates and 1 AND gate. Accordingly,
the critical path of the LCC decoder has 9 gates. On Xilinx
Virtex-II FPGA devices, a clock period of 8ns can be achieved
for this decoder. Hence the LCC decoder can easily achieve
a throughput of around 500Mpbs on FPGA devices. In ASIC
implementations, our decoder can achieve a throughput of at
least several gigabits per second.
Next, the complexity of a hard-decision decoder based on
the BMA is compared to that of the LCC decoder. The
architectures of the BMA are scalable. For the purpose of
comparison,they are scaled to achieve about the same through-
put as the LCC decoder. The architecture for the syndrome
computation can be found in [31]. The syndrome computation
is to evaluate the polynomial associated with the received word
over the n − k roots of the generator polynomial of the RS
code. One syndrome can be computed using one multiplier-
adder loop in 255 clock cycles for the RS (255, 239) code.
To ﬁnish the syndrome computation in about 528 clock cycles,
(n−k)/2 = 8 copies of the multiplier-adder loop are required.
An ultra-folded key equation solver architecture is presentedTABLE III
HARDWARE REQUIREMENT OF HARD-DECISION BERLEKAMP-MASSEY DECODER
GF(2
8) GF(2
8) Mux ROM RAM Register Latency
Multiplier Adder (bits) (bytes) (bytes) (bits) (# of clock cycles)
Syndrome computation 8 8 64 0 0 128 510
Key Equation Solver [32] 2 1 21 0 0 413 400
Chien Search 4 4 48 0 0 128 510
Forney’s Algorithm 5 4 32 256 0 64 0
Hard-decision decoder 19 17 165 256+256 256 × 3 733 510
in [32]. With two multipliers and one adder, this architecture
can compute both the error locator and magnitude polynomials
in 400 clock cycles. The Chien search in the BMA for the RS
(255, 239) code needs to be carried out over 255 ﬁnite ﬁeld
elements for a degree eight error locator polynomial. It can
be ﬁnished by an architecture with four copies of multiplier
and adder in 510 clock cycles. In the BMA, the roots of the
error locator polynomial are actually the inverse of the error
locations. Hence, a ROM of 256 bytes is required to derive
the actual error locations. To reduce the number of pipelining
stages, the Forney’s algorithm can be implemented in parallel
with the Chien search. Accordingly, the extra latency required
by the Fornery’s algorithm is listed as ’0’ in Table III. In
this case, four copies of multiplier and adder are required
to calculate the evaluation values of the polynomial in the
numerator. In addition, another multiplier and an inverter are
needed to compute the error magnitudes from the evaluation
values. The hardware requirement of the hard-decisiondecoder
is summarized in Table III. The critical path of the hard-
decision decoder also consists of one multiplier, one adder
and one multiplexor. Similarly, pipelining cutsets can be added
after the syndrome computation and key equation solver to
achieve higher speed. In this case, the decoding of each
received word takes 510 clock cycles. In addition, three RAMs
of 256 bytes are required to store the hard-decisions until the
errors are computed.
Using composite ﬁeld arithmetic, each GF(28) multiplier
consists of 64 XOR gates and 48 AND gates. Each AND or
OR gate requires 3/4 of the area of an XOR gate, each Mux or
memory cell has the same area as an XOR, and each register
occupies about three times of the area of an XOR. Taking
this into account, the area requirement of the LCC decoder
is around 2.7 times of that of the hard-decision decoder. In
addition, the critical path is the same in both decoders, and
the number of clock cycles in each pipelining stage is about
the same. Hence, the two decoders can achieve about the same
throughput. However, since the LCC decoder has one more
pipelining stage, 528 more clock cycles need to be waited
before the ﬁrst decoded message appears at the output.
VI. CONCLUSION
This paper presented an efﬁcient high-speed VLSI architec-
ture for soft-decision LCC decoding. Employing the backward
interpolation and eliminating the factorization step are major
factors contributing to the high speed and efﬁciency of the
decoder. The proposed decoder can achieve a throughput of
several gigabits per second in ASIC implementations. Com-
pared to a hard-decision decoder with the same throughput,
the soft-decision LCC decoder requires less than three times
the area. As a result, it is feasible to employ soft-decision LCC
decoding in practical applications.
REFERENCES
[1] E. R. Berlekamp, Algebraic Coding Theory, McGraw-Hill, New York,
1968.
[2] R. Koetter and A. Vardy, “Algebraic soft-decision decoding of Reed-
Solomon codes,” IEEE Trans. Inform. Theory, vol. 49, no. 11, pp. 2809-
2825, Nov. 2003.
[3] F. Parvaresh and A. Vardy, “Multiplicity assignments for algebraic soft-
decoding of Reed-Solomon codes,” Proc. Intl. Symp. Info. Theory, pp.
205, Yokohama, Japan, Jul. 2003.
[4] M. El-Khamy and R. J. McEliece, “Interpolation multiplicity assignment
algorithms for algebraic soft-decision decoding of Reed-Solomon codes,”
AMS-DIMACS volume on Algebraic Coding Theory and Info. Theory,
vol. 68, 2005.
[5] N. Ratnakar and R. Koetter, “Exponential error bounds for algebraic soft-
decision decoding of Reed-Solomon codes,” IEEE Trans. on Info. Theory,
vol. 51 pp. 3899-3917, Nov. 2005.
[6] J. Bellorado and A. Kavcic, “A low-complexity method for Chase-type
decoding of Reed-Solomon codes,” Proc. Intl. Symp. Info. Theory, pp.
2037-2041, Seattle, Washington, Jul. 2006.
[7] J. Jiang and K. Narayanan, “Algebraic soft decision decoding of Reed-
Solomon codes using bit-level soft information,” Proc. Allerton Conf.
Commun., Control and Computing, 2006.
[8] M. Sudan, “Decoding of Reed-Solomon codes beyond the error correction
bound,” Journal of Complexity, vol. 12, pp. 180-193, 1997.
[9] V. Guruswami and M. Sudan, “Improved decoding of Reed-Solomon and
algebraic-geometric codes,” IEEE Trans. on Info. Theory, vol. 45, pp.
1755-1764, Sep. 1999.
[10] J. Zhu, X. Zhang and Z. Wang, ”Backward interpolation architecture for
algebraic soft-decision Reed-Solomon decoding,” to appear IEEE Trans.
on VLSI Systems.
[11] W. J. Gross, et. al., “A VLSI architecture for interpolation in soft-
decision decoding of Reed-Solomon codes,” Proc. IEEE Workshop on
Signal Processing Systems, pp. 39-44, San Diego, Oct. 2002.
[12] R. Koetter and A. Vardy, ”A complexity reducing transformation in
algebraic list decoding of Reed-Solomon codes,” Proc. Info. Theory
Workshop, pp. 10-13, Paris, France, Mar. 2003.
[13] J. Zhu and X. Zhang and Z.Wang “Novel interpolation architecture for
low-complexity Chase soft-decision decoding of Reed-Solomon codes,”
Proc. IEEE Intl. Symp. on Circuits and Systems, pp. 3078-3081, Seattle,
WA, May 2008.
[14] J. Zhu and X. Zhang, ”Factorization-free low-complexity Chase soft-
decision decoding of Reed-Solomon codes,” Proc. IEEE Intl. Symp. on
Circuits and Systems, Taiwan, May 2009.
[15] A. Ahmed, R. Koetter and N. Shanbhag, “VLSI architecture for soft-
decision decoding of Reed-Solomon codes,” Proc. IEEE Intl. Conf.
Commun., Paris, France, Jun. 2004.
[16] A. Ahmed, N. Shanbhag and R. Koetter, “Systolic interpolation archi-
tectures for soft-decoding Reed-Solomon codes,” Proc. IEEE Workshop
on Signal Processing Systems, pp. 81-86, Seoul, Korea, Aug. 2003.
[17] X. Zhang, “Reduced complexity interpolation architecture for soft-
decision Reed-Solomon decoding,” IEEE Trans. on VLSI Systems, vol.
14(10), pp. 1156-1161, Oct. 2006.[18] Z. Wang and J. Ma, “High-speed interpolation architecture for soft-
decision decoding of Reed-Solomon codes,” IEEE Trans. on VLSI Sys-
tems, vol. 14, no. (9), pp. 937-950, Sep. 2006.
[19] X. Zhang and J. Zhu, ”Efﬁcient interpolation architecture for soft-
decision Reed-Solomon decoding by applying slow-down,” Proc. IEEE
Workshop on Signal Processing Systems, Washington D.C., Oct. 2008.
[20] J. Zhu and X. Zhang, “Efﬁcient VLSI architecture for soft-decision
decoding of Reed-Solomon codes,” IEEE Trans. on Circuits and Systems-
I. vol. 55(10), pp. 3050-3062, Nov. 2008.
[21] R. Koetter, On Algebraic Decoding of Algebraic-Geometric and Cyclic
Codes, Ph.D. Dissertation, Dept. of Elec. Engr., Linkoping University,
Linkoping, Sweden, 1996.
[22] R. R. Nielson, List Decoder of Linear Block Codes, Ph.D thesis, Dept.
of Mathematics, Technical University of Denmark, 2001.
[23] K. Lee and M. E. O’Sullivan, “An interpolation algorithm using Gr¨ obner
bases for soft-decision decoding of Reed-Solomon codes,” IEEE Intl.
Symp. Info. Theory, Seattle, Washington, Jul. 2006.
[24] R. M. Roth and G. Ruckenstein, “Efﬁcient decoding of Reed-Solomon
codes beyond half the minimum distance,”IEEE Trans. on Info. Theory,
vol. 46, no. 1, pp. 246-257, Jan. 2000.
[25] X. Zhang and K. K. Parhi, “Fast factorization in soft-decision Reed-
Solomon decoding,” IEEE Trans. on VLSI Systems, vol. 13, no. 4, pp.
413-426, Apr. 2005.
[26] X. Zhang, “Partial parallel factorization in soft-decision Reed-Solomon
decoding,” Proc. ACM Great Lakes Symp. VLSI, pp. 272-277, Philadel-
phia, PA, Apr. 2006.
[27] X. Zhang, “Further exploring the strength of prediction in the factor-
ization of soft-decision Reed-Solomon decoding,” IEEE Trans. on VLSI
Systems, vol. 15, no. 7, pp. 811-820, Jul. 2007.
[28] J. Ma, A. Vardy and Z. Wang, “Low latency factorization architecture for
algebraic soft-decision decoding of Reed-Solomon codes,” IEEE Trans.
on VLSI Systems, vol. 15, no. 11, pp. 1225-1238, Nov. 2007.
[29] H. O’Keeffe and P. Fitzpatrick, “Gr¨ onber basis solutions of constrained
interpolation problems,” Linear Algebra and its Applications, vol. 351-
352, pp. 533-551, 2002.
[30] J. Ma, A. Vardy and Z. Wang, “Reencoder design for soft-decision
decoding of an (255,239) Reed-Solomon code,” Proc. IEEE Intl. Symp.
on Circuits and Systems, pp. 3550-3553, Island of Kos, Greece, May
2006.
[31] B. Chen, X. Zhang and Z. Wang, ”Error correction for multi-level NAND
ﬂash memory using Reed-Solomon codes,” Proc. IEEE Workshop on
Signal Processing Systems, Washington D.C., Oct. 2008.
[32] K. V. Seth, K.N. Srinivasan and S. Kamakoti, “Ultra folded high-speed
architectures for Reed Solomon decoders,” Proc. of the 19th International
Conference on VLSI Design, Jan. 2006.