Scalable Block-Wise Product BCH Codes by Wu, Yingquan & Gad, Eyal En
Scalable Block-Wise Product BCH Codes
Yingquan Wu and Eyal En Gad∗
December 19, 2018
Abstract
In this paper we comprehensively investigate block-wise product (BWP) BCH codes,
wherein raw data is arranged in the form of block-wise matrix and each row and column
BCH codes intersect on one data block. We first devise efficient BCH decoding algo-
rithms, including reduced-1-bit decoding, extra-1-bit list decoding, and extra-2-bit list
decoding. We next present a systematic construction of BWP-BCH codes upon given
message and parity lengths that takes into account for performance, implementation
and scalability, rather than focusing on a regularly defined BWP-BCH code. It can
easily accommodate different message length or parity length at minimal changes. It
employs extended BCH codes instead of BCH codes to reduce miscorrection rate and an
inner RS code to lower error floor. We also describe a high-speed scalable encoder. We
finally present a novel iterative decoding algorithm which is divided into three phases.
The first phase iteratively applies reduced BCH correction capabilities to correct lightly
corrupted rows/columns while suppressing miscorrection, until the process stalls. The
second phase iteratively decodes up to the designed correction capabilities, until the
process stalls. The last phase iteratively applies the proposed list decoding in a novel
manner which effectively determines the correct candidate. The key idea is to use cross
decoding upon each list candidate to pick the candidate which enables the maximum
number of successful cross decoding. Our simulations show that the proposed algorithm
provides a significant performance boost compared to the state-of-the-art algorithms.
I. Introduction
Block-wise product (BWP) codes, particularly BWP-BCH codes, recently received quite a lot
of research as well as practical interests [1] – [4]. In BWP codes, the user data is arranged in a
two-dimensional array, composed of rows and columns. Each entry of the array is composed
of multiple bits, and is called a block, or intersecting block, since it intersects a row and a
∗The work was carried out when both authors worked at Micron Technology, Milpitas, CA 95035, USA
1
ar
X
iv
:1
81
2.
07
08
2v
1 
 [c
s.I
T]
  1
7 D
ec
 20
18
column. Each row and column is encoded by an error-correcting code, typically a binary
BCH code. In this work we only consider binary BCH codes as constituent codes. BWP-
BCH codes are also called block-wise concatenated BCH (BC-BCH) codes. The decoding
of a BWP-BCH code is performed iteratively, such that in each iteration the constituent
sensewords in one dimension (say, the rows) are decoded first, and subsequently the words
of the opposite dimension (columns) are decoded. The expansion of the intersection from
a single bit in conventional product codes to block in BWP-BCH codes allows for stronger
(and fewer) constituent codes for the same overall block length and rate parameters. It turns
out that having stronger constituent codes, even though their number is smaller, provides
significant performance boost. A BWP-BCH code with this structure is said to be parallel
concatenated, since the rows and columns can be encoded in parallel. BCH codes can also
be block-wise concatenated in series, by encoding one dimension first (say, the rows), and
then encoding the opposite dimension, including the parity bits of the first dimension. This
is advantageous for classical product codes, but is less effective for BPC codes wherein the
parity-on-parity property no longer holds. An optimization of the construction to improve
the BCH parameters was proposed in [3]. Serial concatenation is investigated in [4]. The
decoding of BWP-BCH codes can be performed in a soft or hard manner. In hard decoding,
the decoder receives a binary channel output for each transmitted bit. The hard decoding
process is simply to iteratively alternate row-wise and column-wise hard decoding using
the Berlekamp algorithm. In soft decoding, the decoder receives multi-bit soft information
regarding the likelihood of the channel value of each transmitted bit, wherein soft decoding
of BCH codes is often referred to as Chase-II decoding [5].
An important challenge in the design of BWP-BCH codes and their decoder is the mit-
igation of error floor. An error floor forms in BWP-BCH codes when the noise level is low
enough for most constituent words (rows and columns) to be corrected, except for a small
number of words whose error count exceeds their correction capability. Three methods were
proposed to mitigate the error floor of BWP-BCH codes: soft decoding [1], a concatenation
of an erasure code over the intersecting blocks [2], and collaborative decoding [3]. In [1], the
decoder obtains additional soft information from the media and uses it to decode the failed
BCH codes. [2] considers to concatenate an erasure code over the intersecting blocks, such
that each block is treated as a symbol of the erasure code. Erroneous blocks are identified by
the intersections of failed rows and columns, and are corrected by the outer erasure code. It
explored Raptor or Reed-Solomon (RS) codes as erasure codes. In [3], the row and column
that intersect in a single erroneous block is combined to cancel the errors in the intersected
block (but adding up errors in parity) in attempt to correct the errors.
BWP-BCH coding is a strong contender in solid-state-drive (SSD) controller [1, 2, 4]. It
particularly appeals to enterprise SSD storage, wherein latency is the most critical metric.
Normal read in flash NAND outputs single bit information without soft reliability, while
it takes much longer latency to generate soft reliability (Typically multiple retry reads are
carried to externally combine into multi-bit soft information, or a dedicated NAND command
carries multiple reads and combines into multi-bit soft information internally). Thus soft-
read in flash does not appeal to enterprise storage. Data sector length is prevalently 4K bytes
(possibly with extra metadata) while gradually migrating to 8K bytes. The typical parity
overhead ranges 8 ∼ 16%. The error floor is required to be below 1e-16. BWP-BCH coding
is also proposed for the optical transport network, wherein only hard-decision decoding is
2
considered [6]. For modern optical transport network, it is generally required that the output
bit-error-rate (BER) be below 1e−15, and the codec be able to achieve very high data rate,
e.g., 100 Gbps. Thus, turbo codes and low density parity check (LDPC) codes are not suited.
In this work we systematically explore BWP-BCH codes. Our contributions are in three-
fold. Firstly, we devise various new decoding algorithms for BCH codes. The proposed −1
decoding algorithm effectively eliminates fruitless Chien search when decoding is proclaimed
unsuccessful. It is often employed in decoding of (block-wise) product BCH codes in order
to reduce BCH miscorrection rate. The refined +1 list decoding algorithm exhibits the same
complexity order as the state-of-the-art hard decoding algorithms. The proposed +2 list
decoding algorithm exhibited a desired computational complexity of O(n2), where n denotes
the code length.
Secondly, we design scalable BWP-BCH codes under the given arbitrarily data and parity
length, in attempt to simultaneously achieve three goals: good scalability, low encoding and
decoding complexity, and good waterfall performance with low error floor. We arrange the
input data in a near square array so that row and column codes share similar parameters,
which allows to share the same circuit for row-wise and column-wise decoding. It also allows
to accommodate slight increasing or decreasing data or parity length with minor architectural
changes. We choose extended BCH (eBCH) code instead of BCH code as constituents in
purpose for reducing constituent-wise miscorrection rate wile simplifying decoding process.
We use an inner high-rate Reed-Solomon code to lower error floor while minimizing extra
parity overhead and implementation complexity.
Lastly, we investigate efficient hard decoding of the proposed BWP-BCH codes. We aim
to minimize three different error events which cause hard decoding failure. The first one is
excessive errors in a few data blocks which causes both row and column decoding failure.
The second one is excessive errors in a few eBCH parities, which are not cross protected.
The last one is miscorrection of eBCH constituents, due to small minimum distance. We
present a novel iterative decoding algorithm which is divided into three phases. The first
phase iteratively applies reduced BCH correction capabilities to correct lightly corrupted
rows/columns while suppressing miscorrection, until the process stalls. The second phase
iteratively decodes up to the designed correction capabilities, until the process stalls. The
last phase iteratively applies the proposed list decoding in a novel manner which effectively
determines the correct candidate as follows. Upon successfully determining a list of can-
didates from a failed row (column) constituent word, trial-correction is performed on each
candidate. Each time check if the crossing column (row) Berlekamp decoding of any previ-
ous failed word is successful. We choose the one that results in the most number of crossing
column (row) corrections and make both row and column corrections accordingly.
The paper is organized as following. Section II devises BCH decoding algorithms for
reduced-1-bit decoding, extra-1-bit decoding, and extra-2-bit decoding. Section III presents
a systematic BPC-BCH code construction that carefully takes into account for performance,
implementation and scalability. Section IV describes a novel iterative decoding algorithm for
BWP-BCH codes aiming to improve both waterfall and error-floor performance. Section V
describes the evaluation setup and results to validate the proposed decoding algorithm.
Section VI concludes with pertinent remarks.
3
II. BCH Decoding Algorithms
Let t denote the designed error-correction capability of a (possibly shortened) BCH (n, k)
code, defined over a binary extension field GF(q = 2m). Let α denote a primitive element
of GF(q). The underlying generator polynomial g(x) of the BCH code contains consecutive
roots α, α2, . . . , α2t, such that
g(x)
4
= LCM
(
µ1(x), µ3(x), . . . , µ2t−1(x)
)
(1)
where µi(x) denotes the minimal binary polynomial of αi and LCM stands for the least
common multiple (cf [7]). A counterpart extended BCH (eBCH) code adds an extra root 1,
i.e.,
g¯(x)
4
= (x− 1)LCM(µ1(x), µ3(x), . . . , µ2t−1(x)). (2)
Clearly, each eBCH codeword has even Hamming weight. It is shown that (p. 263, [7])
Lemma 1 The error-correction capability t of a BCH (n, k) code satisfies
n− k = mt (3)
and
g(x) = µ1(x)µ3(x) . . . µ2t−1(x) (4)
when it holds
t ≤ 2dm/2e−1. (5)
Note the BCH codes of practical interests have high code rates, hence, we shall practically
treat n− k = mt in the subsequent analyses.
Define the support of a codeword to be the index set of its nonzero entries. Next lemma
sheds some light on the codeword distribution
Lemma 2 There does not exist a nonzero codeword whose support lies in the interval of
n− k.
Proof: Assume otherwise is true. Let c(x) be such a codeword polynomial. Then, c(x) must
be in the form of
c(x) = xic∗(x)
where c∗(x) has degree less n − k by assumption. By definition, c(x) divides the genera-
tor polynomial g(x). Note xi is coprime with g(x), thus c∗(x) must divide g(x). This is
apparently false, since deg(c∗(x)) < n− k = deg(g(x)). This concludes the lemma. 22
Let a (possibly shortened) BCH (n, k) code be with the designed error correction capa-
bility t. For a binary senseword r(x) =
∑n−1
j=0 rjx
j, its syndromes are defined
Si
4
= r(αi+1), i = 0, 1, . . . , 2t− 1. (6)
4
The Berlekamp algorithm is a simplified version of the Berlekamp-Massey algorithm for
decoding binary BCH codes by incorporating the special syndrome property
S2i+1 = S
2
i , i = 0, 1, 2, . . . (7)
which yields zero discrepancies at even iterations of the Berlekamp-Massey algorithm (cf.
[10]). Below is a concisely reformulated Berlekamp algorithm.
ALG-1: Reformulated Berlekamp Algorithm
• Input: S = [S0, S1, S2, . . . , S2t−1]
• Initialization: Λ(0)(x) = 1, B(−1)(x) = x, L(0)Λ = 0, L(−1)B = 1
• For r = 0, 2, . . . , 2t− 2, do:
– Compute ∆(r+2) =
∑L(r)Λ
i=0 Λ
(r)
i · Sr−i
– Compute Λ(r+2)(x) = Λ(r)(x)−∆(r+2) · B(r−1)(x)
– If ∆(r+2) 6= 0 and 2L(r)Λ ≤ r, then
∗ Set B(r+1)(x)← (∆(r+2))−1 · x2Λ(r)(x)
∗ Set L(r+2)Λ ← L(r−1)B , L(r+1)B ← L(r)Λ + 2
– Else
∗ Set B(r+1)(x)← x2B(r−1)(x)
∗ Set L(r+1)B ← L(r−1)B + 2, L(r+2)Λ ← L(r)Λ
• Output: Λ(x), B(x), LΛ, LB
In the above algorithm, B(x) is a shifted B(x) which is widely used in textbooks (cf.
[10]), B(x) 4= x2B(x) (we found B(x) more concise in subsequent algorithmic descriptions).
It is easily observed from the above algorithm
LΛ + LB = 2t+ 1. (8)
For the conventional decoding, the so-called Chien search (P. 164, [10]) is an exhaustive root
search among {α−i}n−1i=0 carried out on Λ(x). It is worth noting that, for practical high-rate
BCH codes, the Chien search is far more computationally intensive than the Berlekamp
algorithm and syndrome computation. If the number of distinct roots equals to LΛ, then all
root indexes correspond to the error locations, otherwise the decoding is declared failure.
We refer “−1 decoding” to correcting up to t − 1 errors under the designed correction
capability t, In [15], it is shown that performing reduced-1-bit decoding during the first
iteration effectively achieves superior decoding performance by reducing constituent-wise
miscorrection rate. In Section IV, we shall also incorporate reduced-1-bit decoding into our
proposed iterative decoding of BWP-BCH codes. In the following we present an efficient
reduced-1-bit decoding algorithm.
ALG-2: −1 Decoding Algorithm
5
• Input: S0, S1, . . . , S2t−1
• Apply the Berlekamp algorithm to produce Λ(x) and LΛ.
• If LΛ ≥ t, then declare failure.
• Perform the Chien search to determine all roots. If the number of distinct roots equals
to LΛ then correct all erroneous bits, else declare failure.
Note the extra syndrome S2t−1 is used for t− 1-correction. Its advantage is twofold. When
t−1 correctable, it guarantees LΛ < t, otherwise, LΛ < t occurs with probability q−1, where q
denotes its operation field [10, 11], thus precluding the fruitless Chien search. Clearly, when
there are t errors, the Berlekamp algorithm results in LΛ = t and thus the Chien search is
precluded.
We next introduce the list decoding algorithms to correct extra 1 bit beyond t. Define
Q(x)
4
=
Λ(x)
B(x) (9)
and
Qi = {j : Q(α−j) = i}. (10)
An efficient extra-1-bit list decoding algorithm was given below, with minor modifications
from [11].
ALG-3: +1 List Decoding Algorithm
Input: Λ(x), B(x), LΛ
1. If LΛ > t+ 1, then declare a decoding failure.
2. If LΛ ≤ t, then determine all distinct roots in {α−i}n−1i=0 . If the number of (distinct) roots
is equal to LΛ, then return the corresponding unique codeword, otherwise, if LΛ < t,
declare a decoding failure (which is identical to the normal Berlekamp algorithm)
3. Initialize δi = ∅, i = 0, 1, 2, . . . , q − 1
4. For i = 0, 1, 2, . . . , n− 1, do:
• Evaluate Qi = Λ(α−i)B(α−i) .
• If Qi 6=∞, then set δQi ← δQi ∪ {i}.
• If |δQi | = t+ 1, then flip bits on indices in δQi and output the resulting candidate
codeword.
Note a minor correction in Step 2 is made over the original algorithm in [11]. Specifically,
the clause “if LΛ < t” is necessary for an early termination, whereas LΛ = t may yield valid
t + 1 error corrections. On another note, Bi = 0 results in Qi = ∞. Since B(x) is not a
valid error locator polynomial so its roots are safely ignored. We observe that {Qi}q−1i=0 may
be efficiently implemented by link list structure at the space complexity of O(q). Overall,
6
the computational complexity of the above algorithm remains the same as the Berlekamp
algorithm, i.e., O(tn), but utilizing a larger space complexity of O(q).
Note that n terms of {Qi}n−1i=0 may contribute to at most b nt+1c groups of t + 1 identical
values. Therefore, the above one-step-ahead algorithm may produce up to b n
t+1
c candidate
codewords. An alternative interpretation is to flip each of n bits and each time to apply the
Berlekamp algorithm. This produces at most n codewords (assuming each decoding trail is
successful). Note any codeword is repeated in t+ 1 times, i.e., the same codeword is yielded
by flipping any of t+ 1 error bits. However, if the actual minimum distance is at least 2t+ 3,
particularly for shortened codes, then there is up to single candidate.
For some (particularly low rate) BCH codes, it occurs that S2t+2, . . . , S2t+2τ (τ ≥ 1) are
known. By sweeping S2t over GF(q), up to t+1+τ errors are list decoded with computational
complexity of O(qnt). The algorithm is detailed below
ALG-4: +τ + 1 List Decoding with Known {S2t+2i}τi=1
Input: Λ(2t)(x), B(2t−1)(x), L(2t)Λ , L(2t−1)B , {S2t+2i}τi=1
• For S2t = 0, 1, 2, . . . , q − 1, do:
– For r = 2t, 2t+ 2, . . . , 2t+ 2τ , do:
∗ Compute ∆(r+2) = ∑L(r)Λi=0 Λ(r)i · Sr−i
∗ Compute Λ(r+2)(x) = Λ(r)(x)−∆(r+2) · B(r−1)(x)
∗ If ∆(r+2) 6= 0 and 2L(r)Λ ≤ r, then
· Set B(r+1)(x)← (∆(r+2))−1 · x2Λ(r)(x)
· Set L(r+2)Λ ← L(r−1)B , L(r+1)B ← L(r)Λ + 2
∗ Else
· Set B(r+1)(x)← x2B(r−1)(x)
· Set L(r+1)B ← L(r−1)B + 2, L(r+2)Λ ← L(r)Λ
– Perform the Chien search on Λ(x). If the number of distinct roots equals to LΛ,
then output the resulting candidate codeword.
Consider the (63, 24) BCH code with t = 7. Its generator polynomial contains the roots
α, α3, . . . , α13, but also two extra roots α17 and α19. For this code, up to 3 extra errors,
i.e., up to 10 errors, can be listed decoded by sweeping S14. Note the above algorithm may
further incorporate the preceding extra-1-bit decoding to achieve extra-τ+2-bit list decoding
at the same order of complexity.
We proceed to present an efficient extra-2-bit list decoding algorithm. The basic idea is
to apply a one-pass Chase decoding [12] and to follow with the above extra-1-bit decoding.
[12] described a one-pass Chase decoding algorithm in which the error locator polynomial
associated with flipping a bit can be obtained in constant time and with the computational
complexity of O(t). The following describes one-pass one-bit flipping Chase decoding.
ALG-5: One-Pass One-Bit Flipping Chase Decoding Algorithm
Input: Λ(x), B(x), LΛ, LB
7
• For i = 0, 1, 2, . . . , n− 1, do:
1. Evaluate Λi ← Λ(α−i), Bi ← B(α−i)
2. Update polynomials:
– Case 1: Λi = 0 ∨ (Λi 6= 0 ∧ Bi 6= 0 ∧ LΛ ≥ LB)
Λ(i)(x)← Bi · Λ(x)− Λi · B(x)
B(i)(x)← (x2 − α−2i)B(x)
L
(i)
Λ ← LΛ, L(i)B ← LB + 2
– Case 2: Bi = 0 ∨ (Λi 6= 0 ∧ Bi 6= 0 ∧ LΛ < LB − 1)
Λ(i)(x)← (x2 − α−2i )Λ(x)
B(i)(x)← Bi · x2Λ(x)− α−2iΛi · B(x)
L
(i)
Λ ← LΛ + 2, L(i)B ← LB
– Case 3: Λi 6= 0 ∧ Bi 6= 0 ∧ LΛ = LB − 1
Λ(i)(x)← Bi · Λ(x)− Λi · B(x)
B(i)(x)← Bi · x2Λ(x)− α−2iΛi · B(x)
L
(i)
Λ ← LΛ + 1, L(i)B ← LB + 1
For each pair
(
Λ(i)(x),B(i)(x)), i = 0, 1, . . . , n−1, we may apply the proposed +1 decoding
algorithm to determine all candidates up to t + 1 bits difference (note the index i is pre-
flipped). Thus, combining the above one-pass one-bit flipping Chase decoding algorithm
and the +1 decoding algorithm effectively list decodes all codewords up to distance t + 2.
Clearly, the overall computational complexity is O(n2t), due to n deployments of extra-1-bit
decoding. We now explore ways to reduce complexity. We first note that {(Λi, Bi)}n−1i=0 are
evaluated by the above algorithm, so, instead of updating polynomial pairs (Λ(i)(x),B(i)(x)),
i = 0, 1, . . . , n− 1, it takes O(n) to evaluate each vector pairs
(
{Λ(i)j }n−1j=0 , {B(i)j }n−1j=0
)
for an
index i. Consequently, the +1 decoding algorithm takes merely O(n) complexity. We further
note that, when i-th index is flipped for +1 decoding, all candidates of t+ 2 bits correction
involving at least a bit among indexes {0, 1, 2, . . . , i − 1} have been listed, therefore, the
extra-1-bit decoding algorithm associated with i-th bit flipping suffices to search through the
index subset {i+ 1, i+ 2, . . . , n− 1}. The detailed algorithmic procedure is described below.
ALG-6: +2 List Decoding Algorithm
Input: Λ(x), B(x), LΛ, LB
1. Evaluate and store {Λi}n−1i=0 ← {Λ(α−i)}n−1i=0 , {Bi}n−1i=0 ← {B(α−i)}n−1i=0 .
2. For i = 0, 1, 2, . . . , n− t− 2, do:
(a) Initialize δj = ∅, j = 0, 1, 2, . . . , q − 1.
8
(b) For j = i+ 1, i+ 2, . . . , n− 1, do:
• Compute
Q¯j ←

BiΛj+ΛiBj
(α−2(j−i)+1)Bj , if Λi = 0 ∨ (Λi 6= 0 ∧ Bi 6= 0 ∧ LΛ ≥ LB)
(α−2j+α−2i)Λj
α−2(j−i)BiΛj+ΛiBj , if Bi = 0 ∨ (Λi 6= 0 ∧ Bi 6= 0 ∧ LΛ < LB − 1)
BiΛj+ΛiBj
α−2(j−i)BiΛj+ΛiBj , otherwise
• If Q¯j 6=∞, then set δQ¯j ← δQ¯j ∪ {j}.
• If |δQ¯j | = t + 1, then flip bits on indices in δQ¯j ∪ {i} and output the resulting
candidate codeword.
Note that Q¯j in the above Step 2.b is scaled by a constant α−2i without altering re-
sult. Clearly, the above algorithm exhibits a computational complexity of O(n2) and space
complexity of O(q). Assume in a perfect scenario that flipping each of two bits results in a
candidate with t errors. There are up to
(
n
2
)
candidate codewords. Note each candidate is
exactly repeatedly counted in
(
t+2
2
)
times, this is because flipping any of 2 out of t+2 errors is
corrected to the same codeword. Therefore, the number of candidate codewords is bounded
by n(n−1)
(t+2)(t+1)
. However, if the actual minimum distance is at least 2t+ 3, particularly in the
case of shortened codes, then there exist up to b n
t+2
c candidates. Moreover, if the actual
minimum distance is at least 2t+ 5, particularly in the case of highly shortened codes, then
there is up to single candidate.
An alternative but less efficient approach to perform extra-2-bit list decoding is by sweep-
ing all possibilities of S2t, equivalently all possibilities of ∆2t+2, and then deploying extra-1-bit
list decoding over each of q pairs of
(
Λ(x), B(x)). Its complexity, after appropriate opti-
mization, is reduced to O(qn). This approach is akin to the t+ 1 list decoding algorithm for
Reed-Solomon codes [13]. The proposed +2 decoding algorithm is clearly advantageous when
n q, which often holds true during iterative decoding of BPC-BCH and other product-like
BCH codes.
Algorithms presented in this section assume the consecutive error locators, {αi}n−1i=0 , as
defined in conventionally shortened codes. In the scenario of BWP-BCH decoding, due to
partial correction by the cross decoding, the error locators are usually not consecutive. Let
{αi}n∗−1i=0 denote the set of uncorrected error locators (where n∗ ≤ n). To this end, αi (likewise
αj) in the above algorithms are to be replaced by αi, such that, α−i → α−1i , α−2i → α−2i .
III. Designing Scalable Block-Wise Product BCH Codes
Instead of following conventional wisdom of studying a regularly defined BWP-BCH code,
we start from scratch with an arbitrarily given message length K and parity length R.
We leverage the following freedoms to design a “good” BWP-BCH code, wherein “good”
qualitatively means good waterfall performance and low error floor,
(i). block size b;
(ii). BCH vs. eBCH;
9
D0,0 D0,1 D0,2 D0,3
D1,0 D1,1 D1,3D1,2
D2,0 D2,1 D2,3D2,2
Dp-2,0 Dp-2,1 Dp-2,2 Dp-2,3
Dp-1,0 Dp-1,1 Dp-1,3Dp-1,2
D0,p-2 D0,p-1
D1,p-2 D1,p-1
D2,p-2 D2,p-1
Dp-2,p-1
Di,p-2 Di,p-1
0
eB
C
H
t
eB
C
H
t+
1
eB
C
H
t+
1
eB
C
H
t+
1
eBCH
t
eB
C
H
t
eBCH
t
eBCH
t
eBCH
t
eBCH
t
…...
…...
…...
…...
…...
…
...
…
... Di,p-2 RS0 eBCH
t
eB
C
H
t
…
...
Figure 1: An example illustration of the case-1 BWP-BCH codeword.
(iii). serial vs. parallel concatenation;
(iv). square vs. rectangular shape in organizing message blocks;
(v). concatenation of an outer or inner erasure code.
Specifically, we leverage these freedoms under the following guidelines.
(i). Block size b affects the BCH message length, operation field, and its error correction
capability. b must be relatively large so that the resulting BCH codes exhibit low miscor-
rection rate. It is preferred to have t ≥ 4. Also, it is preferred to choose b a multiple of
8 to facilitate hardware implementation. However, It is unnecessary to force b dividing K,
as opposed to the literature [1, 3]. This is because a partial last message block (wherein b
does not divide K) can be mitigated at no rate penalty. Specifically, the partial block is
padded with
⌈
K
b
⌉ · b−K zeros to form a full block for both inner and outer coding, but the
padded zeros are not actually transmitted and the equal number of zeros are re-padded at
the receiver. Though it is possible to harness the error-free property of the padded zeros to
rule out certain miscorrection, we suggest to treat the padded block as a regular block, so
that all row and column BCH constituent words are treated unanimously.
(ii). An eBCH code includes an extra parity bit to enforce even code weights, and thus
10
D0,0 D0,1 D0,2 D0,3
D1,0 D1,1 D1,3D1,2
D2,0 D2,1 D2,3D2,2
Dp-2,0 Dp-2,1 Dp-2,2 Dp-2,3
Dp-1,0 Dp-1,1 Dp-1,3Dp-1,2
D0,p-1 D0,p
D1,p-1 D1,p
D2,p-1 D2,p
Dp-2,p-1
Di,p-1 Di,p
0
eB
C
H
t
eBCH
t
eB
C
H
t
eBCH
t
eBCH
t
eBCH
t+1
eBCH
t+1
…...
…...
…...
…...
…...
…
...
…
... Di,p-1 RS0 eBCH
t
eB
C
H
t
…
...
eB
C
H
t
eB
C
H
t
eB
C
H
t
D0,p-2
D1,p-2
D2,p-2
Di,p-2
Di,p-2
Dp-2,p-2
eB
C
H
t
Figure 2: An example illustration of the case-2 BWP-BCH codeword.
halves the miscorrection rate. Using a parity bit also allows to reduce decoding complexity.
For a given eBCH senseword, either +1 list decoding or +2 list decoding is applicable, but
not both. This is because, the parity syndrome being 0 indicates that the number of errors
must be even, whereas 1 indicates an odd number of errors. In [16], it is proven that the
number of Chase-II decoding trials of eBCH codes can be cut by half without performance
loss, rendering more efficient soft-decision decoding of product eBCH codes. Our extensive
simulations also indicate that BWP-eBCH codes performs slightly superior to the counter-
part BWP-BCH codes, attributed to the smaller miscorrection of eBCH codes. Therefore,
we shall use eBCH, instead of BCH, codes as constituent code.
(iii). In parallel concatenation, each message block is protected by a row and column eBCH
code, but neither row nor column eBCH parity is protected by the other, whereas in serial
concatenation, there is one dimension of parities that are covered by the other dimension
of parities. It is known that BWP codes loose the key property of parity-on-parity for
conventional product codes. Parity-on-parity in conventional product codes yields a mini-
11
mum distance of product of row and column minimum distances but it is not true for BWP
codes even with protecting one-dimensional parity, i.e., serial concatenation. In short, se-
rial concatenation does not exhibit a conspicuous property, such as a product of minimum
distances. Therefore, we choose parallel concatenation wherein row and column decoding
are symmetrical. The symmetry effectively enables to design the same circuit for row and
column decoding. Moreover, parallel concatenation inherently allows for row and column
encoding in parallel.
Lemma 2 sheds some light on minimum weight codewords (associated with the minimum
distance). There is no codeword whose support lies only in eBCH parities. Furthermore,
if the block size is no greater than eBCH parity length (which is often the true case), then
there does not exist a codeword whose support lies only in one message block. An inner
f -erasure RS code guarantees at least f + 1 non-zero blocks, yielding at least f + 1 uncor-
related non-zero eBCH codewords.
(iv). Message blocks are organized in a near square shape so that row and column decod-
ing operations are nearly identical. To enforce code scalability as well as implementation
simplicity, all eBCH codes are defined in the same field (even if the last data column has a
single block).
(v). Concatenation of an erasure code mitigates error floor at the price of lower BCH correc-
tion capabilities. An inner Reed-Solomon (RS) code is incorporated to mitigate error floor.
This is different from [2], wherein an outer erasure code is concatenated. An outer code has
to protect both message blocks, eBCH parities, along with its own parity, therefore, it de-
mands for more parity than an inner code which just protects message blocks. The rationale
behind not protecting eBCH parities is that, when rows/columns have few errors in their
data messages but excessive errors in parities which result in decoding failure, the errors in
data messages can be corrected by cross decoding predominantly, thus the errors in parities
can be simply recovered through re-encoding. In fact, our extensive simulations indicate
that up to 4 RS parity symbols suffice to reduce error floor satisfactorily for our codes of
interest. Therefore, adopting an inner RS code reserves more redundancy for eBCH parity
but also reduces its implementation complexity. Note that two erasure blocks enables to
recover single-row plus two-column failures or single-column plus two-row failures. Likewise,
three erasure blocks enables to recover single-row plus three-column failures or single-column
plus three-row failures. Four erasure blocks enables to recover single-row plus four-column
failures, single-column plus four-row, or two-row plus two-column failures. When inner code
is not used, i.e., f = 0, the BWP-BCH code suffers from the dominant failure mechanism of
single uncorrectable row and column [3]. The authors in [3] proposed a collaborative method
to forge a new BCH senseword from the failed row and column constituents. However, this
approach only succeeds in some cases, while taking extra endeavor to forge new syndromes.
On the other hand, the single row and column failure is handily thwarted using a parity
inner coding, at the cost of one parity block overhead.
(vi). It allows to easily accommodate multiple message lengths and parity lengths in
high granularity. This is particularly important in data storage, wherein different ven-
dors/customers have slightly different requests.
Given the block size b, the number of message blocks is dK
b
e. Assume also f -erasure RS
encoding is deployed. The inner RS code length is given by η 4= dK
b
e + f . Evidently, the
12
minimum field dimension for RS coding is dlog2 ηe. Let p satisfies
p(p− 1) < η ≤ p(p+ 1). (11)
Note the positive number set, Z+, is disjointedly partitioned into
Z+ =
∞∑
p=1
[
(p(p− 1), p(p+ 1)
]
.
Thus, any positive number is uniquely distributed to one of intervals
(
p(p − 1), p(p + 1)].
To solve for p, first let a be the real positive root of the equation
a(a+ 1) = η.
We obtain
a =
−1 +√1 + 4η
2
.
It is easily verified that p = dae, i.e.,
p =
⌈−1 +√1 + 4η
2
⌉
(12)
is the unique solution of (11).
We further partition into two cases. The first case is such that
p(p− 1) < η ≤ p2. (13)
Then the inner RS code is arranged into p × p matrix. There are 2p eBCH words, each is
allocated with average dR−fb
2p
e parity bits. Accordingly, the eBCH code field dimension is
determined by
m =
⌈
log2
(
pb+
⌈
R− fb
2p
⌉)⌉
, (14)
and the base correction capability is given by
t =
⌊
R− fb− 2p
2pm
⌋
, (15)
wherein we assume the code rate is high enough to meet the condition of Lemma 1. For
lower rate codes where Lemma 1 does not apply, t is determined through computer search.
Note there remains extra correction power of
θ =
⌊
R− fb− 2p
m
⌋
− 2pt. (16)
When θ > 0, the τ longest inner block rows/columns (in the sequel, a row/column always
implies a block row/column) are assigned with t + 1 correction capability, whereas the re-
maining 2p− θ inner blocks rows/columns are assigned with t correction capability. In this
13
case, due to uneven distribution of eBCH parities, it is necessary to cross check the validity
of (14), such that
pb+ (t+ 1)m+ 1 < 2m. (17)
Figure 1 illustrates an example of the above BWP-BCH code description, wherein the last
partial data block is padded with zeros and single-parity RS code is used.
The second case is such that
p2 < η ≤ p(p+ 1). (18)
Then the inner RS code is arranged in a p×(p+1) matrix. There are 2p+1 eBCH words, each
is allocated with average dR−fb
2p+1
e parity bits. Accordingly, the eBCH code field dimension is
determined by
m =
⌈
log2
(
(p+ 1)b+
⌈
R− fb
2p+ 1
⌉)⌉
, (19)
and the base correction capability is given by
t =
⌊
R− fb− (2p+ 1)
(2p+ 1)m
⌋
. (20)
The residual correction power is determined by
θ =
⌊
R− fb− (2p+ 1)
m
⌋
− (2p+ 1)t. (21)
Likewise, the θ longest inner block rows/columns are assigned with t+1 correction capability,
whereas the remaining 2p + 1− θ inner blocks rows/columns are assigned with t correction
capability. When θ > 0, it is necessary to validate the field dimension m such that
(p+ 1)b+ (t+ 1)m+ 1 < 2m. (22)
Figure 2 illustrates an example of the above BWP-BCH code construction.
Clearly, the foregoing coding configuration is totally determined by the two parameters,
b and f . b is purposed to optimize the waterfall performance. f is mainly associated with
error floor. The larger f yields the lower error floor, however, at the price of rate penalty.
Since the inner blocks are arranged in a (near) square shape, the dominant error event is
such that the equal number of rows and columns are uncorrectable, yielding a square number
of intersecting blocks. For this reason, it is preferred to choose f to be a square number, say
1, 4. It is shown in our simulations that f = 4 achieves good balance between low error floor
and superior waterfall. The failure probability of i rows failures and j columns failures are
extensively investigated in literature ([1, 2, 3]). Our case also needs to take into account for
different correction capabilities among rows (columns).
Consider a data storage example wherein the data length is 4K bytes, i.e., K = 32768,
and the parity length is 455 bytes, i.e., R = 3640. The code rate is 0.9. Assume block
size to be b = 32 and RS parity length is f = 4. The number of inner code blocks is then⌈
K
b
⌉
+ f = 1024 + 4 = 1028. Accordingly, it belongs to the case 2, and results in p = 32.
The inner RS code is organized into 32 rows by 33 columns, wherein the last column has
only 4 blocks. The resulting eBCH codes are defined over the field dimension of m = 11.
14
Table 1: BWP-BCH mapping for (K = 32768, R = 3640, b = 32, f = 4).
Rows/Columns Inner Blocks eBCH t
4 rows 33 5
28 rows 32 5
21 columns 32 5
10 columns 32 4
1 column 4 4
Table 2: BWP-BCH mapping for (K = 32768, R = 3640, b = 15, f = 4).
Rows/Columns Inner Blocks eBCH t
27 rows 47 4
20 rows 46 4
19 columns 47 4
27 column 47 3
1 column 24 3
The base correction capability is given by t =
⌊
3640−193
65×11
⌋
= 4. The residual correction
power is determined by τ =
⌊
3640−193
11
⌋ − 65 × 5 = 53. We obtain the complete BWP-BCH
configuration as in Table 1.
Consider a different block size b = 15 while keeping f = 4. The number of inner blocks
is now 2189. Accordingly, the inner RS code is organized into 47 by 47 block matrix,
wherein the last column has 27 blocks. The resulting eBCH codes are defined in 10-bit field,
with the base correction capability t =
⌊
3640−154
2×47×10
⌋
= 3. The residual correction power is
τ =
⌊
3640−154
10
⌋− 94× 3 = 66. Table 2 shows the detailed BWP-BCH configuration.
Note one condition must be satisfied to enforce nontrivial RS coding
2b ≥ η 4=
⌈
K
b
⌉
+ f, (23)
where f > 1 denotes the number of RS parity blocks and η denotes the number of inner
RS blocks (for short, inner blocks). However, this enforcement is not needed if f = 1, i.e.,
in the case of trivial parity coding. For implementation simplicity, we choose its generator
polynomial
gRS(x) = (x− 1)(x− β) . . . (x− βf−1) (24)
where β denotes a primitive element of the RS operation field. Its erasure-only decoding
effectively recovers up to f erased symbols, as briefly described below (cf. [10]). Upon
receiving a senseword y(x), its syndromes are computed as follows
Sˆi = y(β
i), i = 0, 1, . . . , f − 1, (25)
wherein the notation Sˆ is to differentiate from eBCH syndromes S. Let X0, X1, . . . , Xe−1
15
Data Block
RS Encoder
RS Encoder
... b/p-bit/clock Row
BCH Encoder 1
b-bit/C
lock C
olum
n
B
C
H
 E
ncoder
b-bit/C
lock C
olum
n
B
C
H
 E
ncoder
Block
Buffer
b/p-bit/clock Row
BCH Encoder 0
Block
Buffer
b/p-bit/clock Row
BCH Encoder p-1
Block
Buffer
…
…
.
Clk<K/b
0
1
Figure 3: Encoder block diagram for BWP-BCH codes
(e < f), be (known) erasure locators. Then its erasure locator polynomial is given by
Λˆ(x)
4
= (1−X0x)(1−X1x)...(1−Xe−1x) (26)
and erasure evaluator polynomial by
Ωˆ(x)
4
= Λˆ(x)Sˆ(x) (mod xf ). (27)
Erasure-only decoding is successful if and only if
deg(Ωˆ(x)) < e. (28)
If true, then the corresponding erasure values, {Yi}e−1i=0 , are retrieved by
Yi =
Ωˆ(X−1i )
Λˆodd(X
−1
i )
, i = 0, 1, . . . , e− 1, (29)
where Λˆodd(x) denotes the odd term polynomial of Λˆ(x).
16
In some cases (23) is not met, then RS coding with f > 1 is infeasible. We next consider
to relax (23). Let [Di,j] be p × (p + 1) data block array, wherein empty array blocks are
treated as zero blocks. The RS data vector, [Dˆ0, Dˆ1, Dˆ2, . . . , Dˆ2p−2], is produced by XORing
[Di,j] reverse diagonally, i.e.,
Dˆi
4
= ⊕l+j=iDl,j, i = 0, 1, . . . , 2p− 2, (30)
where ⊕ denotes bit-wise XOR. Note D2p−1 is not defined as Dp−1,p is reserved for RS parity.
Accordingly, (23) is relaxed to
2b > 2p− 1 + f. (31)
It is worth noting that employing (30) also renders simpler implementation. This is because
a block can be partitioned into multiple sub-blocks such that each of them is separately
protected by an RS code defined over a small operation field. Clearly, failed blocks in
single-column plus multiple-rows or single-row plus multiple-columns belong to different RS
symbols, and thus can be uniquely recovered. However, it may not fully work for f = 4.
This is because, when two-row plus two-column failure occurs, two failed blocks may line
diagonally and thus belong to the same RS symbol. In this case, the remaining two failed
blocks must belong to different RS symbols and thus are uniquely recovered. Consequently,
the remaining two (uncorrelated) blocks are predominantly corrected by decoding row-wise
or column-wise. For implementation simplicity, it is desirable to use the above method when
f ≤ 4.
We next describe an efficient high-speed encoder. First note it is common that block size
is much greater than the required RS symbol size, i.e., b log2(η). Instead of treating a block
as an RS symbol, we divide a block into multiple RS symbols and encode each to a separate
RS code. This way dramatically reduces the circuit complexity of finite field multiplier
and divisor. As in the above example, the 10-bit RS coding suffices for b = 40. Thus, it
suffices to partition each block into 4 RS symbols and to encode to 4 RS codes respectively.
Secondly, assume a column-wise block of data is transferred each clock, theoretically we may
use one high-speed BCH encoder (cf. [8, 9]) to process b bits in a clock. However, there
is difficulty to offload parity and immediately switch to next column encoding (with proper
register initializations). To this end, we use two column encoders to ping-pong for the task.
On the other hand, a row encoder only needs to process a block of data upon transferring
p column-wise blocks of data. Our solution is to add a block buffer in front of each row
encoder and design a low-performance encoder such that it processes only db/pe bits per
cycle (eBCH encoder is halted if completed less than p cycles). As far as encoding a block,
row and column eBCH encoders, as well as RS encoders, follow the same first-in-first-out
bit sequence. Figure 3 depicts the block diagram of the proposed b-bits/clock encoder. Note
that the proposed encoder applies for any inner block length within
(
p(p − 1), p(p + 1)],
at the same time accommodates different eBCH parities. It is worth pointing out that the
enforcement of single BCH field allows to effectively share BCH encoder/decoder.
17
IV. New Iterative harding Decoding of BWP-BCH Codes
Apparently, all existing BWP-BCH decoding algorithms are applicable (possibly with minor
modification) to the proposed codes. In this section, we explore more efficient hard decoding
of the proposed BWP-BCH codes. We aim to lower error floor by effectively handling
three types of dominant error events. The first one is excessive errors in a few data blocks
which causes both row and column decoding failure. The second one is excessive errors in
a few eBCH parities, which are not cross protected. The last one is miscorrection of eBCH
constituents, due to small minimum distance. We also aim to boost waterfall performance
through intelligently incorporating the proposed extra-1-bit and extra-2-bit list decoding
algorithms. We present a novel iterative decoding algorithm for the proposed BWP-BCH
codes, in the following three phases.
I. Iteratively alternate row and column reduced-1-bit decoding until the process stalls or
a pre-determined maximum number of iterations is reached.
II. Iteratively alternate row and column regular decoding until the process stalls or a
pre-determined maximum number of iterations is reached.
III. Iteratively alternate row and column list decoding up to extra-2-bit errors until the
process stalls or a pre-determined maximum number of iterations is reached.
The underlying purpose of Phase-I is to reduce miscorrection rate so as to avoid error
amplification. We next deep dive into the implementation details. We call decoding stalling if
the numbers of failed rows and columns remain unchanged in a full iteration. Upon decoding
stalling, let the number of failed intersecting blocks be the product of the number of failed
rows by the number of failed columns. Upon the completion of each half-iteration, i.e., row-
wise or column-wise eBCH decoding, the decoding status is checked; the process is early
terminated if a decoding success is declared. Herein the decoding success is defined as the
number of failed intersecting blocks is up to f , and RS erasure-only decoding is successful.
We give the following examples to clarify this criterion.
• If all row (column) eBCH constituents are successfully decoded but not all column
(row) eBCH constituents, and RS syndromes are zeros, then, the number of failed
intersecting blocks is zero and thus is declared success even without inner RS coding,
wherein failed constituents can be simply corrected by re-encoding.
• If all row (column) eBCH constituents are successfully decoded but not all column
(row) eBCH constituents, and RS syndromes are non-zeros, then, it is proclaimed
unsuccessful.
• If the number of failed intersecting blocks is less than f , then, it is declared success
only if erasure-only decoding is successful, but not erasure-and-error decoding. When
erasure-only decoding is successful, re-encoding is deployed to correct eBCH parities.
• If the number of failed intersecting blocks is greater than f , then declare failure.
18
• If all eBCH constituents are successfully decoded but RS syndromes are non-zeros,
then it is proclaimed unsuccessful. This is because our designing purpose of inner RS
coding is for erasure recovery, whereas random error correction may result in intractable
decoding behaviors.
We next present the details of syndrome computation and update. When sequentially
receiving the senseword, the (single-bit) parity syndrome and even-indexed syndromes for
each eBCH codes (note the odd-indexed syndromes are not saved but computed on-the-
fly through (7)) and RS syndromes are simultaneously computed. Each time an eBCH
constituent is successfully decoded, corrections are immediately made to the senseword, and
syndromes of crossing eBCH constituents and the RS code(s) are updated accordingly. When
indexes, denoted by {il}ι−1l=0 associated with a t-correcting eBCH constituent is corrected by
decoding of crossing eBCH constituents, its syndromes are updated as follows.
Sj ← Sj +
ι−1∑
l=0
αj+1il , j = 0, 2, . . . , 2t− 2. (32)
It is worth noting that the parity syndrome is used to eliminate unnecessary decodings.
Specifically, if the parity syndrome plus the targeted number of corrections is an odd number
then the corresponding decoding is deemed unsuccessful.
In Phase-II, the decoding is limited to failed rows/columns. Upon stalling, if the number
of intersecting blocks among failed eBCH words is up to f , then RS erasure decoding is called
to recover those blocks and subsequently the senseword is corrected. We remark that the
proposed Phase-I and II decoding is motivated from [15]. There is a minor difference such
that, reduced-1-bit iterative decoding is run until stalling for Phase-I, as opposed to limiting
to the first iteration in [15]. In addition, the proposed reduced-1-bit decoding algorithm
effectively rules out unfruitful Chien searches.
Phase-III is different from the previous two phases, as each eBCH list decoding may
produces multiple candidates. To reduce the number of candidates as well as to reduce search
complexity, the evaluation of ∆(x), as defined in (9), is limited to the failed intersecting
blocks and the parity block. One trivial solution is to randomly pick a candidate. However,
this suffers miscorrection greatly due to high probability of wrong picking.
Herein we present an alternative approach. Upon successfully determining a list of can-
didates from a failed row (column) constituent word, we perform trial correction on each
candidate. Each time check if the crossing column (row) Berlekamp decoding of any previ-
ous failed word is successful. We choose the one that results in the most number of crossing
column (row) corrections and make both row and column corrections accordingly. We shall
discard all if none results in a crossing column (row) correction. Then the next row (column)
is carried out in the same manner. This approach effectively takes advantage of crossing val-
idation. Also note this trial-and-error approach is performed at the last phase wherein only
a few rows and columns remain to be corrected, so the complexity increment is at most
moderate.
Two indicator vectors are exploited to ease implementation, namely, the correction in-
dicator vector and the syndrome update indicator vector. The correction indicator vector
tracks correction status. When an eBCH constituent is corrected, its syndromes are reset to
19
Table 3: Number of parity bits in simulated codes with respect to the fixed data size of
K = 32768. Right column shows BCH correcting power (t).
Rate BWP BCH t (BCH)
0.889 4082 4088 258
0.9 3634 3640 228
0.93 2463 2472 155
zeros, while its corresponding indicator is delayed for a whole iteration to set to 1 (the ratio-
nale behind is to exploit cross decoding validation to reduce miscorrection). The proposed
iterative decoding skips a constituent eBCH word if its correction indicator is 1. Moreover,
the evaluation of ∆(x) in Phase-III skips blocks that are corrected from earlier iterations,
i.e., their correction indicators are 1. The syndrome update vector keeps track of syndrome
update. When correction is made by crossing eBCH constituents, its syndromes are updated
accordingly and the corresponding indicator is set to 1. The proposed iterative decoding
skips a constituent eBCH word if its syndrome update indicator is 0. When decoding of a
eBCH constituent is done (Regardless of failure or success), its syndrome update indicator
is reset to zero. At the beginning of each phase, syndrome update indicators corresponding
to all rows/columns are initialized to 1.
V. Performance Evaluation
We simulate the proposed codes and decoding algorithms to evaluate their effectiveness. The
size of the user data used for the simulation is 4kB (K = 32768). We evaluate code rates of
0.889, 0.9 and 0.93. For each code rate, we simulate decoding with no list decoding, with up
to extra-1-bit list decoding, and with up to extra-2-bit list decoding. We further compare
with stand-alone BCH codes of the same rates and lengths. The exact numbers of parity
bits are specified in Table 3.
We use an inner RS code with f = 4 parities to reduce the error floor. The main
additional free parameter is the block size b. For each code rate, we simulate various block
sizes to identify one with good performance. An accurate optimization of the block size
would not be practical, since simulating low frame error rates takes a long time. Further,
we observe that the best performing block size for a given raw-bit-error-rate (RBER) and
a given decoding algorithm tends also to perform (near) the best for minor varying RBERs
and minor different algorithms. To this end, we settle for a good performing block size for
each given code, instead of separately optimizing for each RBER point and each decoding
algorithm. The block sizes we find to perform well are b = 20 for rate 0.889, b = 15 for rate
0.9 and b = 50 for rate 0.93. The simulation results are presented in Figure 4. The maximum
number of Turbo iterations is set to 32. Each simulation runs until 100 BWP-BCH frame
failures are observed.
The simulation results provide several insights:
1. List decoding is significantly superior to unique decoding in this setting. The error
20
2345678
RBER 1e 3
10 8
10 7
10 6
10 5
10 4
10 3
10 2
FE
R
Simulation results for 4 RS parities.
7.6 7.8 8.0 8.2 8.4 8.6 8.8 9.0 9.2
SNR (dB)
10 8
10 7
10 6
10 5
10 4
10 3
10 2
FE
R
R=0.889 BCH
R=0.889 t
R=0.889 t+1
R=0.889 t+2
R=0.9 BCH
R=0.9 t
R=0.9 t+1
R=0.9 t+2
R=0.93 BCH
R=0.93 t
R=0.93 t+1
R=0.93 t+2
Figure 4: R is the code rate, FER is the frame error rate, RBER is the raw bit error rate,
and SNR is the signal to noise ratio (2Eb/N0).
21
7.00 7.25 7.50 7.75 8.00 8.25 8.50
SNR
10 12
10 11
10 10
10 9
10 8
10 7
10 6
10 5
10 4
10 3
10 2
FE
R
BCH
Capacity
t
t error floor estimate
t+1
t+1 error floor estimate
t+2
t+2 error floor estimate
Figure 5: Benefit of +2 list decoding in error-floor region. The error-floor plots are numerical
estimates, wherein code rate 0.9, 1 RS parity, block size b = 20.
7.8 7.9 8.0 8.1 8.2 8.3 8.4 8.5
SNR
10 7
10 6
10 5
10 4
10 3
10 2
FE
R
1 RS parities
4 RS parities
Figure 6: Comparison of 1 vs 4 RS parities with unique decoding at code rate 0.9 and block
size b = 15.
22
34
5
6
7
8
9
10
#
of
it
er
at
io
n
s
5×10−3 6×10−3 7×10−3
RBER
t
t + 1 and t + 2
Figure 7: Mean number of iterations under code rate 0.9, 1 RS parity and block size 20.
floor of unique decoding is very high, while list decoding reduces the error floor below
the observable zone.
2. In the lower code rates, BWP-BCH codes provide significant gain over stand-alone
BCH, with 0.4 dB for rate 0.889. While the gain almost entirely disappears in rate
0.93, it is known that other iteratively decoded codes, such as LDPC of similar length,
also do not outperform BCH in this rate (in hard decoding).
3. Incorporating +2 list decoding provides a very small gain over that of +1 list decoding.
However, we show next that incorporating +2 list decoding is beneficial for reducing
the error-floor.
To show the benefit of incorporating +2 list decoding over that of +1 list decoding,
we use a numerical method to estimate the error floor. The method is straightforward, as
described in [1]. We evaluate for the code under rate 0.9, with a single RS parity and block
size of 20. Figure 5 shows simulation results along with error floor estimations. The error
floor estimation proves to be quite accurate for radius t and t+ 1. The figure suggests that
+2 list decoding improves the error floor by 2.5 orders of magnitudes over t+ 1. On another
note, at the benchmark frame-error-rate of 1e−6, the proposed iterative decoding algorithm
is shown to be 1 dB gap from the Shannon capacity, while gaining 0.35 dB over the BCH
performance.
Note that the error-floor estimation method from [1] does not consider list decoding.
To use this method with list decoding (t + 1 and +2 list decoding), we assume that the
list decoding always returns only the unique (correct) codeword. In other words, we use the
extended decoding radius as if it was the actual decoding radius of the BCH component codes.
The accuracy of the assumption is supported by the relative accuracy of the estimations for
t and t+ 1 decoding.
23
It remains unclear as to why is the +2 list decoding improvement is so insignificant in
the waterfall region. An investigation of the failure scenarios reveals the reason. In most
cases when +1 list decoding gets stuck, the number of non-zero BCH syndromes is rather
high. In such case, miscorrections often happen in the BCH decoding, since the hint from
the opposite dimension is not so helpful. As a result, the BWP decoding is unable to correct
those BCH sensewords.
In contrast, in the error-floor region, the number of non-zero BCH syndromes is small,
resulting in a good hint for the list decoder. Therefore, +2 list decoding resolves most t+ 1
failures. Note that the reasoning above is in principal also valid for the gain of t+1 decoding
over t decoding in the waterfall region. However, we do see a significant gain in +1 list
decoding in the waterfall region. The reason for this behavior is that the lists of radius t+ 1
are almost always much smaller than those of radius t + 2 for the values of t used in the
considered simulations (around 5).
Finally, note that the code in Figure 5 contains a single RS parity, while the codes in
Figure 4 contain 4 parities. It is instructive to consider the trade-off that governs the choice
for the number of RS parities. Intuitively, we would expect a code with less RS parities to
perform better in the waterfall region, since the parities are used to strengthen the BCH
codes. On the other hand, a code with more RS parities would perform better in the error-
floor region, since the additional RS parities would protect better from errors concentrated
in a small numbers of blocks. The choice of the number of RS parities should be made
according to the desired error-floor level. We illustrate the trade-off by the simulation results
in Figure 6, at code rate 0.9 and block size b = 15, with unique decoding. Furthermore, the
mean number of iterations is presented in Figure 7.
VI. Concluding Remarks
The paper studies BWP-BCH codes with a focus on flash memory applications. Firstly,
novel efficient BCH decoding algorithms are firstly presented, including −1 decoding, +1 list
decoding, and +2 list decoding. Secondly, a unified construction framework of BWP-BCH
codes is presented by leveraging many design freedoms to compromise among design scala-
bility, implementation simplicity and superior performance. A high-speed scalable encoder
is described. Finally, a novel iterative decoding algorithm for BWP-BCH codes is presented,
which utilizes the proposed BCH decoding algorithms to optimize decoding performance.
Simulation results demonstrate superior waterfall performance and significantly lowered er-
ror floor. Notably, it achieves 1 dB gap from capacity under the benchmark of 4kB data
size, the code rate of 0.9 and the frame-error-rate of 1e−6.
There are many problems to be explored. One interesting problem is to build soft infor-
mation inside the proposed iterative decoding. One would wonder how much further gain
may be achieved. Another interesting problem is to extend the proposed decoding algorithm
to soft-input soft-output decoding. Furthermore, it is easily observed that concatenating an
inner RS code significantly improves the minimum distance bound. However, to our view,
it is still too loose to be meaningful in a straightforward manner. Finally, on the hardware
perspective, the proposed +1 list decoding employs dynamic grouping and rational function
evaluation, which are overly complex. A simplified hardware implementation would certainly
24
renders the proposed iterative decoding more practical.
References
[1] S.-G. Cho, D. Kim, J. Choi, and J. Ha, “Block-wise concatenated BCH codes for NAND
flash memories,” IEEE Trans. Commun., vol. 62, no. 4, pp. 1164–1177, Apr. 2014.
[2] G. Yu and J. Moon, “Concatenated Raptor codes in NAND flash memory,” IEEE J.
Selected Areas in Commun., vol. 32, no. 5, pp. 857–869, May 2014.
[3] D. Kim and J. Ha, “Quasi-primitive block-wise concatenated BCH codes with collabo-
rative decoding for NAND flash memories,” IEEE Trans. Commun., vol. 63, no. 10, pp.
3482–3496, Oct. 2015.
[4] D. Kim and J. Ha, “Serial quasi-primitive BC-BCH codes for NAND flash memories,”
Proc. IEEE Int. Conf. Commun. (ICC), pp. 1–6, May 2016.
[5] R. Pyndiah, “Near-optimum decoding of product codes: block turbo codes,” IEEE
Trans. Commun., vol. 46, no. 8, pp. 1003–1010, Aug. 1998.
[6] ITU-T, G.975.1: Forward error correction for high bit-rate DWDM submarine systems.
Available: https://www.itu.int/rec/T-REC-G.975.1/en.
[7] F. J. MacWilliams and N. J. A. Sloane, The Theory of Error Correcting Codes, Ams-
terdam: North-Holland, 1977.
[8] T.-B. Pei and C. Zukowski, “High-speed parallel CRC circuits in VLSI,” IEEE Trans.
Commun., vol. 40, no. 4, pp. 653—657, Apr. 1992.
[9] X. Zhang and K. K. Parhi, “High-speed architectures for parallel long BCH encoders,”
IEEE Trans. Very Large Scale Integration (VLSI) Systems, vol. 13, no. 7, pp. 872–877,
July 2005.
[10] R. E. Blahut, Algebraic Codes for Data Transmission, Cambridge: Cambridge Univer-
sity Press, 2003.
[11] Y. Wu, “New list decoding algorithms for Reed-Solomon and BCH codes,” IEEE Trans.
Inf. Theory, vol. 54, pp. 3611–3630, Aug. 2008.
[12] Y. Wu, “Fast Chase decoding algorithms and architectures for Reed–Solomon codes,”
IEEE Trans. Inf. Theory, vol. 58, pp. 109–129, Jan. 2012.
[13] S. Egorov, G. Markarian, and K. Pickavance, “A modified Blahut algorithm for decoding
Reed–Solomon codes beyond half the minimum distance,” IEEE Trans. Commun., vol.
52, no. 12, pp. 2052–2056, Dec. 2004.
[14] Z. Wang, “Super-FEC codes for 40/100 Gbps networking,” IEEE Commun. Letters, vol
16, no. 12, pp. 2056–2059, Dec. 2012.
25
[15] A. J. Al-Dweik and B. S. Sharif, “Non-sequential decoding algorithm for hard iterative
turbo product codes,” IEEE Trans. Commun., vol. 57, no. 6, pp. 1545–1549, June 2009.
[16] G. Chen, L. Cao, L. Yu, and C. Chen, “Test-pattern-reduced decoding for turbo product
codes with multi-error-correcting eBCH codes,” IEEE Trans. Commun., vol. 57, no. 2,
pp. 307–310, Feb. 2009.
26
