Optimization of Multi-Channel BCH Error Decoding for Common Cases by Dill, Russell (Author) et al.
Optimization of Multi-Channel BCH
Error Decoding for Common Cases
by
Russell Dill
A Thesis Presented in Partial Fulfillment
of the Requirements for the Degree
Master of Science
Approved April 2015 by the
Graduate Supervisory Committee:
Aviral Shrivastava, Chair
Hyunok Oh
Arunabha Sen
ARIZONA STATEUNIVERSITY
May 2015
©2015 Russell Dill
All Rights Reserved
ABSTRACT
Error correcting systems have put increasing demands on system designers, both due
to increasing error correcting requirements and higher throughput targets. These require-
ments have led to greater silicon area, power consumption and have forced system designers
to make trade-offs in Error Correcting Code (ECC) functionality. Solutions to increase the
efficiency of ECC systems are very important to system designers and have become a heavily
researched area.
Many such systems incorporate the Bose-Chaudhuri-Hocquenghem (BCH) method of
error correcting in amulti-channel configuration. BCH is a commonly used code because of
its configurability, low storage overhead, and lowdecoding requirements when compared to
other codes. Multi-channel configurations are popular with system designers because they
offer a straightforwardway to increase bandwidth. The ECChardware is duplicated for each
channel and the throughput increases linearly with the number of channels. The combina-
tion of these two technologies provides a configurable and high throughput ECC architec-
ture.
This research proposes a new method to optimize a BCH error correction decoder in
multi-channel configurations. In this thesis, I examine how error frequency effects the uti-
lization of BCH hardware. Rather than implement each decoder as a single pipeline of in-
dependent decoding stages, the channels are considered together and served by a pool of
decoding stages. Modified hardware blocks for handling common cases are included and
the pool is sized based on an acceptable, but negligible decrease in performance.
i
This thesis’s experimental approach examinesmulti-channel configurations found in typ-
ical NAND flash systems. My experimental data shows that the proposed pooled group
approach requires significantly fewer hardware blocks than a traditional multi-channel con-
figuration. By allowing a 2% performance degradation and sizing the decoding pool appro-
priately, the scheme reduces hardware area by 47%–71% and dynamic power by 44%–59%.
Additionally, I examined what improvements were possible with the improved design
using the same hardware area as the traditional implementation. My experiments show that
an improved throughput of 3x–5x can be achieved or NAND flash lifetime can be extended
by 1:4x–4:5x.
ii
DEDICATION
Thॹ paper ॹ dedicated to my loving wife who hॷ had both eternal patience with my
own commitments ॷ well ॷ the enerॽ to deal with her own struॾlॸ.
iii
ACKNOWLEDGEMENTS
The road to completing a thesis is long, bumpy, often confusing, and yet also exciting,
enriching, and rewarding I could not have travelled this road alone and owe much of my
success to those who have helped me along the way. I’d like to thank those who have helped
me get to where I am today.
On such a journey, it is invaluable to have an excellent guide. Without such a guide, I
would have meandered more than I did and I certainly would have never completed my re-
search. I have great gratitude for my academic advisor and committee chair, Dr. Shrivastava.
Dr. Shrivastava has provided invaluable input into both my research and my writing.
I’d also like to thank Dr. Oh who has been able to provide valuable insight and advice.
He has proven an invaluable resource in my area of research and my thesis would be much
poorer without his help.
Finally I’d like to thank the entire advising graduate advising department, who have had
to endure my countless questions, forms, requests, and overrides. Christina Sebring, Cyn-
thia Donahue, and Martha Vander Berg have not only ensured that I met the necessary re-
quirements but pushed when necessary.
iv
TABLEOF CONTENTS
Page
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
CHAPTER
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Error Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Flash Memory Lifetime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Types of ECC Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Reed-Solomon Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Convolution Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.6 Turbo Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.7 LDPC Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.8 BCH Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.8.1 Finite Field Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.8.2 Finite Field Operations Utilizing LFSR . . . . . . . . . . . . . . . . . . . . . . 17
2.8.3 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.8.4 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.8.4.1 Syndrome Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.8.4.2 Error Locator Polynomial Generation . . . . . . . . . . . . . . . . . 20
2.8.4.3 Root Finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 RELATEDWORKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1 Improving Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Improving Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
v
CHAPTER Page
4 MAINOBSERVATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5 MY APPROACH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.1.1 Syndromes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.1.2 Syndrome/Error Locator Polynomial Interconnect . . . . . . . . . . . . 31
5.1.3 Error Locator Polynomial Generator . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.1.4 Error Locator Polynomial/Root Solver Interconnect . . . . . . . . . . 32
5.1.5 Traditional Chien Root Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1.6 Reduced Root Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.1.7 Output Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2 Determining the Number of Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6 EXPERIMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.2 Baseline Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.3 Area Optimized BCHDecoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.4 Throughput Optimized BCHDecoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.5 Flash Lifetime Optimized Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7 CONCLUSIONAND FUTUREWORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
vi
LIST OF TABLES
Table Page
1 x3 +X + 1 overGF (23) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2 Targeted ECC Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3 Hardware Units Required for Area Optimized Decoder . . . . . . . . . . . . . . . . . . . . . . . . . 44
4 Hardware Units Required for Lifetime Optimized Design . . . . . . . . . . . . . . . . . . . . . . . 48
5 BER Achievable with Lifetime Optimized Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
vii
LIST OF FIGURES
Figure Page
1 Basic BCHDecoder Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 P/E Cycles, BER, and ECC Strength Relation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 BCH Codeword Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4 Example LFSR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5 LFSR with Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
6 BCHDecoding Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
7 Probabilities of Errors at BER of 1e-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
8 An Example of the Proposed BCHDecoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
9 Probability that More thanm Blocks Contain at Least One Error Where n = 8 . . . . 38
10 Probability that More thanm Blocks Contain More than One Error Where n = 8 . 39
11 Units Required for BER of 2e-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
12 Area Saving Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
13 Power Saving Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
14 Requirements of 2e-5 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
15 Throughput Optimization Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
16 Improved Lifetime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
viii
Chapter 1
INTRODUCTION
Error rates in storage and communication channels are increasing (Luyi, Jinyi, and Xi-
aohua 2012). Forward Error Correction (FEC) is a commonly used method to decrease the
error rates of those channels (Rate 1983). FEC adds redundant information to themessage to
allow the receiver to correct errors. BCH codes are very commonly used across a wide range
of systems (Sun, Rose, and Zhang 2006). Some of the systems that utilize BCH error cor-
rection are; wireless communication links, NAND flash storage, magnetic storage, on-chip
cache memories, DRAMmemory arrays, and data buses.
Although encoding BCH is fairly straightforward, performing the decoding steps is
muchmore complex (Zambelli et al. 2012). Systemdesignersmust balance the high complex-
ity of BCH decoders with their overall system requirements (Strukov 2006). The decoders
must provide high throughput, either by running at high clock speeds or by implementing
bit parallel operation. The maximum clock speed of the decoder is limited by the process
technology and the complexity of the decoder. Additionally, adding bit-parallel operation
increases the area of the decoder and makes it more difficult to achieve high clock speeds.
Limited available area for the decoder can also limit the number of errors that can be cor-
rected.
By developing a more area efficient BCH decoder, several possibilities open up besides
simply reducing area. The area savings can be used to add bit-parallel operation to improve
throughput. Alternatively the decoder could be designed to correct more errors extending
the useful life of flash memory or increasing the bit-rate of a communication channel.
1
Σ
Error Locator
Equation
S
Syndrome
Vectors
C
Chien
Search
Figure 1. Basic BCH decoder structure
A typical BCH decoder implementation is essentially a 3-stage pipeline as shown in fig-
ure 1. The three stages of the pipeline are syndrome calculation, generating the error locator
polynomial, and finding the roots of the error locator polynomial (Hong and Vetterli 1995).
Each pipeline stage operates simultaneously and independently. Data is passed between the
stages when the current stage is complete and the next stage is ready to receive the data. This
pipelined configuration allows the decoder to operate on 3 codes simultaneously.
The first stage, syndrome calculation is similar in fashion to encoding and at similar cost.
A simple logic circuit known as a Linear Feedback Shift Register (LFSR) is typically used for
syndrome calculation. As LFSRs are used in encoding and syndrome calculation, work has
gone to optimize high speed bit-parallel LFSR operation for BCH.
Calculating the error locator polynomial is performed by successive approximation us-
ing the Berlekamp-Massey algorithm. The implementation of the algorithm requires many
multipliers and dividers, and consumes a large portion of the decoder. General work into
optimized Berlekamp-Massey implementations has been done as well as the sharing of
Berlekamp-Massey units between BCH channels.
Solving for the roots of the error locator polynomial is typical performed by brute force
using an algorithm known as a Chien search (Litwin 2001). This algorithm checks for a root
at each possible value of x. The Chien search can be expanded to a bit-parallel architecture.
Optimization of this algorithm has been researched heavily, especially in the bit parallel case
due to the large area requirements.
2
Previous works have concentrated on optimizing the stages of single-channel decoders.
Much progress has been made on improving the performance and efficiency of individual
stages of the BCH decoding process. Although syndrome calculation is the simplest step,
it has still received much attention as similar hardware is also used for BCH encoding. As
performing operations in a bit-parallel manner can be used to improve performance, Jun
et al. Jun et al. 2005 have presented work in improving LFSR performance. Additionally,
Lee, Yoo, and Park (2012) have presented work on improving the syndrome calculation tech-
niques. Generating the error locator polynomial is the most algorithmically complex step
of BCH decoding. Compounding the issue, it cannot be modified for bit-parallel operation
to improve throughput. Jamro (1997) has demonstrated a method of preloading the initial
two steps of the algorithm aswell as utilizing basis rearrangement to combine two serial steps
into one. The final stage of the algorithm is root finding, typically implementedby theChien
search. Kristian et al. (2010) has demonstrated the straightforward step to convert the Chien
search from a purely serial operation to a bit-parallel operation. As moving to bit-parallel
operation quickly increases hardware area, Chen and Parhi (2004) have developed a group
matching scheme to reduce the hardware complexity in the bit-parallel case.
In order to achieve further advances in BCH decoding, I examine the decoding process
as a whole and specifically as implemented in multi-channel architectures. A multi-channel
BCHdecoder is typically designed by putting several single-channel BCHdecoders together
in parallel. For each set of decoded blocks, only a small fraction of the full error correct-
ing capability is used. For instance, if no error is present in a block, which can be detected
during the syndrome calculation, no additional stages are required. If one error is present
in a block, the error locator polynomial can be solved directly rather than through a brute
force search. For a wide range of error rates, these two cases are very common. My idea
then is to optimize a multi-channel architecture for the common case, rather than the worst
3
case. I use these observations along with the reduced root solver to optimize the stages of
the BCH decoder pipeline so that the area requirements are greatly reduced while the opti-
mization incurs a negligible performance degradation. The proposed optimizations reduce
power consumption and area requirements greatly. Additionally, by trading saved area for
greater complexity, we can improve throughput and error correcting capability as well.
In this thesis, I examine a fixed architecture decoder configured for a representative range
of error correction capability. The base configuration chosen for the decoder is 8 channels,
each 4 bits wide running at 200MHz. This provides a total throughput of 6:4Gbit=s. Ex-
periments cover decoding strengths of 5 bits, 7 bits, 8 bits, and 10 bits. This covers a typical
range of error rates. For the design parameters examined in this research, I achieve an area
savings of 47%–71% if I allow a 2% performance degradation. For my test platform, this
translates to a dynamic power savings between 44% and 62%.
Rather than reducing the area of the optimized design, I can keep the area the same and
instead improve performance. My technique increases throughput by 3x–5x with the same
area. Also, the improvements can increase the error correcting capability of the decoderwith
the same area, which increases the usable life of flashmemory. The ageing of flashmemory is
determined by the number of Program/Erase (P/E) cycles each block has undergone. As the
number of P/E cycles increases, the error rate also increases. There is a threshold then where
the number of P/E cycles and associated error rate exceeds the error correction capability of
the BCH decoder. Although the raw error rate increases rapidly as flash memory ages, the
optimized decoder can improve flash lifetime by 1.4x–4.5x.
4
Chapter 2
BACKGROUND
2.1 Error Rates
The key component to understanding FEC and the improvements in this research is un-
derstanding error rates. Information theory tells us that coding systems exist that allow us
to use noisy communication channels reliably. From the central result of Claude Shannon’s
information theory (Shannon 1948):
Let a discrete channel have the capacityC and a discrete source the entropy per
secondH . IfH  C there exists a coding system such that the output of the
source can be transmitted over the channel with an arbitrarily small frequency
of errors.
Typical FECs transform the input data by adding specially calculated redundant check
bits to form a codeword. The appropriate code must be selected for a number of bits to be
corrected and a chosen block size. Larger block sizes have lower storage overhead, but higher
algorithmic complexity.
If the number of errors that occur within the codeword exceeds the capability of the
chosen code, an uncorrectable error occurs. The probability that an uncorrectable error oc-
curring within a codeword determines the new channel error rate. This rate is calculated by
determining the probability that t or fewer errorswill occur in a block (where t is the number
of errors that can be corrected by the code) and then working backwards to obtain the new
bit error rate of the channel. This calculation also accounts for the coding loss, the additional
probability that an error will occur in the redundant bits of the codeword.
5
In order to perform these calculations, the necessary values are the Bit Error Rate (BER),
p, the number of bits in the codeword, n, the error correcting capability of the code, t, and
the desired uncorrectable BER. The most basic calculation is determining that an error free
message is received. This is true if every bit in themessage is correct (Houghton 2001, p. 168).
I will represent this probability with P0(n).
P0(n) = (1  p)n (2.1)
It is straightforward to calculate from eq. 2.1 the probability that at least one error has
occurred, :P0(n).
:P0(n) = 1  P0(n) (2.2)
:P0(n) = 1  (1  p)n (2.3)
Moving on from this, one can calculate the probability that exactlym errors occur in a
message, Peq(m;n).
Peq(m;n) = p
m(1  p)n m

n
m

(2.4)
By summing eq. 2.4 for various values ofm, one can calculate the probability thatm or
fewer errors occur, Ple(m;n):
Ple(m;n) =
mX
k=0
Peq(k; n) (2.5)
Ple(m;n) =
mX
k=0

pk(1  p)n k

n
k

(2.6)
One can thenuse eq. 2.6 to find the probability thatmore thanm errors occur,Pgt(m;n).
6
Pgt(m;n) = 1  Ple(m;n) (2.7)
Pgt(m;n) = 1 
mX
k=0

pk(1  p)n k

n
k

(2.8)
Eq. 2.8 is important in selecting a BCH code as it shows the probability that a block
contains an uncorrectable error. One can then work backwards to find the uncorrectable
error rate by plugging the result of eq. 2.8 into eq. 2.1 and reversing it.
p(t; n)uncorr = 1  Pgt(t; n)1=n (2.9)
Thus given a BER, p, a block size n, and a designed uncorrectable error rate, a sufficient
t can be found.
2.2 Flash Memory Lifetime
The push to maximize the storage capacity of NAND flash memory has led to a storage
medium that requires extensive error correction in order to be reliable. The primary causes
of increasing error rates in flash memory are due to a decreasing process size and an increase
in the number of bits stored per cell. Both of these techniques are able to increase storage
space well beyond the additional overhead required by ECC.
Theproperties that lead to high storage densitieswithin flashmemory also lead to a lower
lifetime. The wearing out of flash memory cells is caused by the high voltages incurred dur-
ing P/E cycles. These high voltages lead to a deterioration of the tunnel oxide within the
cell which then allows leakage. Smaller process geometries have a smaller tunnel oxide layer
which wears faster. The smaller process geometries leave less margin for damage that occurs
to the cell.
7
The lifetime of flashmemory is rated by the number of P/E cycles it is intended to endure
before being retired. Typical P/E lifetimes are rated in thousands of cycles. The targeted
lifetime in P/E cycles is chosen as a compromise between durability and ECC requirements.
However, by reducing the area and power required by BCH decoding substantially, that
compromise can be shifted and the lifetime of the flash memory extended.
The data collected by Cai et al. (2012) shows that the relation between P/E cycles and
error rates generally follows a polynomial growth. The BER for 3x-nm technology Multi-
level Cell (MLC) NAND flash examined in their research closely follows the relation:
BER = A  age2 (2.10)
Where A is a constant specific to a given flash memory. In rearranging the equation to
show the relationbetween age andBER, the constant is eliminated and the following relation
is shown:
BER2
BER1
=

age2
age1
2
(2.11)
8
So that a doubling of the P/E cycles leads to a quadrupling of the BER. Figure 2 shows
the relation between P/E cycles, the BER, and the strength of the BCH code required (Cai
et al. 2012).
3k 6k 12k 24k
10 4
10 3
10 2
P/E cycles
BE
R
BER
0
50
100
150
200 ECC
strength
(bits)
ECC strength
Figure 2. P/E cycles, BER, and ECC strength relation (Cai et al. 2012)
The amount of ECC strength required is calculated by using a block size of 4096 bits
and targeted uncorrectable bit error rate of 110 15. However, the number of bits of ECC
overhead scales at a much faster rate.
2.3 Types of ECC Schemes
Due to awide range of often conflicting requirements, awide range of ECC schemes have
been developed. Although codes vary in many different ways, they fall under two primary
categories. These primary categories are block based codes, and convolution codes (Morelos-
Zaragoza 2006).
9
Block based codes as the name implies operate on blocks. The encoder accepts a fixed
sized input, and adds redundant bits to provide a fixed sized output block. Auseful property
of block based codes is that each block can be encoded and decoded independently. This is in
contrast to convolution codes. Convolution codes operate on a streamof data using a sliding
window. This typicallymeans that the lengthof encodeddata is variablewith terminationon
both ends. Convolution codes tend tobemore efficient thanblock codes and some approach
the Shannon limit.
Maximum Distance Separable (MDS) codes are an interesting subset of ECC schemes.
An MDS code provides the greatest possible error correcting capability for a given message
size and codeword size (Puttarak 2011). MDS codes are excellent erasure codes and see wide
use in storage systems. When used in storage systems, portions of the codeword are spread
across multiple storage volumes. This allows for a certain number of volumes to be lost and
still maintain data integrity. While MDS codes offer a number of advantages in efficiency,
they also suffers from a number of limitations. Only a small set of MDS codes exist and
are not as configurable as other codes. Practical decoders and encoders also have quadratic
encoding and decoding complexity.
Additional information can be fed to the decoder to improve the chances that the output
will be error free (Epstein 1958). The simplest form of information is erasure knowledge.
Locations known to contain invalid data (erasures) are fed to the decoder. Codes that can
use this extra information are known as erasure codes.
10
Beyond simple erasure information, some codes can accept probability information.
Soft-decision decoding uses the probability of a bit being a specific valuewhen decoding (Ha-
genauer and Hoeher 1989). This increases the complexity of not only the decoder, but also
the associated input hardware. The input hardware is modified to provide a set number of
probability levels rather than just one or zero. Decoders using soft-decision decoding typi-
cally use iterative belief based algorithms.
High complexity soft-decision decoders that operate close to the Shannon limit faceNP -
complete decoding complexity (Han, Hartmann, and Chen 1993). Practical decoders are
implemented using what is known as suboptimal decoding. However, the suboptimal de-
coding creates an effect known as the error floor (Garello et al. 2001). This is the result of
input that is decoded incorrectly due to the suboptimal decoding method. Predicting trou-
blesome inputs and the nature of the error floor require long simulations (McGregor and
Milenkovic 2010).
Because of the wide range, strengths, and weaknesses of available codes, many systems
combine them together forming a concatenated code. Typically an outer code with good
erasure performance is chosen, along with an inner convolution code with good random er-
ror correction. This allows the strengthof the outer and inner code tobe combined (Justesen,
Høholdt, and Thommesen 2004).
11
2.4 Reed-Solomon Codes
Reed-Solomon codes are a non-binary block error correcting code. Each block consists
of a set of symbols (Reed and Solomon 1960). It can be used either as an error correcting
code, an erasure code, or a combination of both. Because bits are arranged in symbols, it is
best suited for applications where errors occur in bursts as errors are unlikely to effect more
than one or two symbols. However, single bit random errors still destroy entire symbols
making the code a poor choice for channels with random errors. Reed-Solomon codes are
used in areas such as optical disks, QR codes, disk arrays, and digital video transmission.
2.5 Convolution Codes
Convolution codes include a wide range of codes defined by the input rate, output rate,
memory, and feedback polynomial (Forney Jr 1973). Although the codes are all part of the
same family, decoding strategies vary widely and greatly effect the usability of the code. Con-
volutiondecoders typically implement soft-decisiondecoding andbothoptimal and subopti-
mal algorithms are available depending on the complexity of the code. Because convolution
codes utilize a slidingwindow, they are best suited for systems that streamdata. Convolution
codes are being superseded, but are still popular in satellite and mobile communications.
12
2.6 Turbo Codes
The complexity of the convolution code decoding process has given rise to a modified
class of convolution codes known as turbo codes. A turbo code is formed by using multiple
permutations of a convolution code. Decoders are typically capable of soft-decision decod-
ing and efficiency approaches the Shannon limit (Berrou andGlavieux 1996). Because of the
complexity involved, suboptimal decoders using a belief propagation algorithm are required
for real world implementations. This suboptimal decoding means that turbo codes are ef-
fected by the error floor and are often wrappedwithin a hard-decision decoder. Turbo codes
are used in areas such as satellite communications and mobile networks (including the 3G
and 4G standards). Turbo codes perform best at low code rates.
13
2.7 LDPC Codes
Low-Density Parity-Check (LDPC) codes are a class of block based error correcting
codes (Gallager 1962). Like turbo codes, if used with a soft-decision decoder they are able
to perform very close to the Shannon limit (Richardson, Shokrollahi, and Urbanke 2001).
When used as a hard-decision decoder, LDPC codes give similar performance to BCH codes.
Being a block code, LDPC is usable in a lot of applications where convolution codes are a
poor fit. Additionally, LDPC codes perform well at both high and low code rates. These
codes have the advantage linear time complexity decoding while still offering performance
very close to the Shannon limit. However, like turbo codes, they require suboptimal belief
propagationbaseddecoders and thushave issueswith an error floor. Outer encodings such as
Reed-Solomon or BCH are typically used to correct the error floor effect. LDPC codes have
been gaining popularity in recent years and are used in areas such as digital video transmis-
sion, high speed Ethernet, and is just starting to be used in NAND flash memories (Marvell
2014).
2.8 BCH Codes
BCH is a block based error correction code meaning that it operates on a block of bits at
time (Bose and Ray-Chaudhuri 1960). It transforms the input data by adding specially cal-
culated redundant check bits to form a codeword. The appropriate code can be selected for
a number of bits to be corrected and a chosen block size. Larger block sizes have lower stor-
age overhead, but higher algorithmic complexity. This gives BCH a number of advantages,
including:
• Configurability for number of bits to be corrected.
14
• Scales to different word sizes.
• Optimal algebraic method for decoding.
• No error floor.
• Original data embedded in codeword.
Each codeword within the code is constructed such that it is a minimumHamming dis-
tance away from any other codeword. The Hamming distance, dmin is determined by the
number of bits that must be changed within a valid codeword to transform it into another
valid codeword. The number of bit errors that can be detected is thus one less than theHam-
ming distance. Figure 3 shows the structure of a BCH codeword, including the message and
the redundant ECC data that is added to form the codeword.
Codeword
Message ECC
Figure 3. BCH codeword structure
The function of the decoder is to determine which valid codeword received codeword
most closely represents. If a codeword receives enough bit errors to cross half or more of
the Hamming distance between two codewords, it will be incorrectly detected. Thus the
number that can can be corrected, t, is related to the minimum Hamming distance by the
following relation:
dmin  2t+ 1 (2.12)
15
The encoding and decoding BCH codes is performed by using finite fields. A short
overview of finites fields is necessary in understanding both the mechanism of BCH codes
and the proposed improvements.
2.8.1 Finite Field Overview
As the name implies, a finite field contains a finite number of elements. Within the set of
elements, operations are defined such as addition, subtraction, multiplication, and division.
All such operations on field elements result in another field element. Although awide variety
of finite fields can be defined, the use of a binary finite fields makes for a straightforward
implementation using digital systems.
A binary finite field is defined by its degree, n, denoted as GF (2n). The elements of a
finite field are created by a generator polynomial. Each element in the field is a successive
power of the generator polynomial. Thus the index of the element within the field is known
as the power form. For example, forGF (23), with a generator polynomial of x3+x+1, the
field is produced shown in table 1:
Table 1. x3 + x+ 1 overGF (23)
Power form Polynomial form Binary representation
0 0 b000
x0 1 b001
x1 x b010
x2 x2 b100
x3 x+ 1 b011
x4 x2 + x b110
x5 x2 + x+ 1 b111
x6 x2 + 1 b101
16
Finite field addition and subtraction is performed by adding or subtracting the polyno-
mial form. Because the order of the field is two (binary field), addition and subtraction are
equivalent. In either case, any two equal powers of x cancel out. For example, adding x2
and x2 + x + 1 produces x + 1. This is the equivalent of the logical Exclusive or (XOR)
operation.
Finite field multiplication is performed by multiplying the two polynomials together,
performing elimination of terms as described above, and then taking the result modulo the
generator polynomial. Finite field division is the inverse of finite field multiplication.
When utilizing finite fields for BCH codes, the number of elements in the field is equal
to the number of bits within a codeword. For instance, GF (28) contains 255 elements (ex-
cluding 0). The associated BCH block size would be 255 bits.
In order tomake BCH codes easier to work with, only a portion of the codeword is used
and the rest of the bits are set to zero. For instance, when using a block size of 16 bytes (128
bits), a BCH code with a block size of 255 bits would be selected. Throughout this thesis,
codewords are assumed to be constructed in this way.
2.8.2 Finite Field Operations Utilizing LFSR
LFSRs are commonly used for finite field operations. The basic operation of a LFSR
allows one to transform a finite field element to the next or previous elementwithin the field.
This is equivalent to multiplying or dividing by x1. Thus repeated operation can multiply
or divide by any power of x.
17
A LFSR consists of a set of registers interconnected in a ring configuration. Between
each register there can be an XOR gate. The XOR gate combines the value of the previous
register with feedback from the highest register. An example LFSR is shown in figure 4.
The configuration shown can be used to produce the finite field shown in table 1. This is
because the connections match the binary representation of the generator polynomial. In
this configuration, the LFSR will cycle through each element of the field in order.
D₁D₂ D₀+
Figure 4. Example LFSR
LFSRs are commonly used for BCH operations, either in their default form, or in a
slightly modified form that allows other operations, such as determining the quotient and
remainder of a division(Saluja 1987). Such an LFSR is shown in figure 5. The numerator is
fed into the input serially, and the XOR gates are chosen to represent the divisor.
D₁D₂ D₀+
+
Figure 5. LFSR with input
18
2.8.3 Encoding
BCHencoding is performedby dividing the input data by a specially formedpolynomial.
This is performed utilizing a modified LFSR that accepts a bit of input data per clock cycle.
At the end of the operation, the LFSR contains the remainder of the operation which is the
redundant code bits (J.-H. Lee et al. 2013).
2.8.4 Decoding
The decoding process is broken into three stages. The input codeword is passed into the
first stage and error locations are generated by the final stage. The stages operate indepen-
dently and thus the process can be pipelined with three codewords being decoded simulta-
neously. Figure 6 shows the hardware stages of the decoding process. In the figure, the red
squares within the codeword represent error locations.
Σ
Error Locator
Equation
S
Syndrome
Vectors
C
Chien
Search
Codeword
Syndromes
Error locator
polynomial
coe cients
x² + x¹ + 1
Error
locations
s , s₃, s₂, s₁
Serial
data
Serial
data
Figure 6. BCH decoding process
19
2.8.4.1 Syndrome Computation
The first stage, syndrome computation, accepts the input data. The syndromes are a set
of values that once computed, depend only on the error locations within the message, and
not on the message itself. The number of syndromes is twice the number of errors that the
BCH code can correct, t. This produces an underdetermined system, giving many possible
solutions for error locations. It is up to the next stage to solve for the most likely situation.
The syndromes are generated by dividing the codeword by a set of minimal polynomials
producing a set of remainders. Because of relations between theminimal polynomials, many
syndrome elements can be easily derived from the other elements, reducing the amount of
computation required. Auseful property of the syndromes is that if all calculated syndromes
are zero, then no errors exist in the received message.
Syndrome elements can be calculated by amodified LFSR or by repeatedmultiplication.
Themost efficientmethod for the given syndrome should be chosen. Bothmethods operate
on one input bit at a time. This limits the overall bandwidth of the decoder to the clock
rate of the syndrome units. However, syndrome calculation can bemodified to perform bit-
parallel operations, greatly increasing the throughput of the syndrome calculation stage at
the cost of increased area and power.
2.8.4.2 Error Locator Polynomial Generation
The error locator polynomial is defined such that its roots give the locations of the er-
rors within the message. The number of roots, or degree, of the error locator polynomial
indicates the number of errors within the message. The second stage of the BCH decoding
process is to generate the error locator polynomial from the set of syndromes.
20
The Berlekamp-Massey algorithm was developed to generate the error locator polyno-
mial from a set of syndromes. It is an iterative algorithm which calculates a discrepancy at
each stage, refining the approximation. This process requires several finite field multiplica-
tions, divisions, and additions per cycle of the algorithm which contributes to the overall
complexity of the decoder.
One set of syndromes canproducemultiple possible error locator polynomials, eachwith
a different degree. It is assumed by the algorithm that the most likely occurrence, the fewest
number of errors, indicates themost likely error locator polynomial. This highlights the fact
that if more errors occur than the code is configured to handle, the decoder may decode the
input data incorrectly.
2.8.4.3 Root Finding
To find error locations, roots of the error locator polynomial must be found. Since the
degree of polynomial can be as large as t, a brute force algorithm is used for hardware BCH
implementations. An optimized algorithm used for this brute force search has been devel-
oped and is known as a Chien search. To implement the Chien search, a set of registers is
loaded with the coefficients of the error locator polynomial. During each cycle of the Chien
search, each register is multiplied by xn, where n is the degree of x associated with the given
coefficient. At the end of each cycle, all registers are summed. If the sum of all the registers
is zero, then a root has been located. The cycle number indicates the index within the block
of the error location.
The order of the Chien output can be made to match the order of the input message.
Thus the output of the BCH decoder is a set of locations within the message that must be
toggled to correct received errors.
21
As in syndrome computation, the Chien search operates one bit per cycle and the band-
width is thus limited to the clock speed of the Chien unit. To improve bandwidth, multiple
Chien search stepsmust be performed each cycle. Themost straightforwardway of perform-
ing this parallel operation is toduplicate theChien searchblock for eachbit of parallel output.
Each stagemust skip aheadbyk cycleswhere k is the number of parallel outputs. While some
logic can be shared between the parallel units, the cost in area and power of parallelizing the
Chien search operation is high.
22
Chapter 3
RELATEDWORKS
Optimizing BCH decoders has generally followed two sometimes complementary and
sometimes conflicting paths. These paths are to increase the throughput of the decoder and
to increase the efficiency of the decoder. Here we examine the current state of the art and
related research in those two areas.
3.1 Improving Throughput
Although increasing clock rate leads directly to an increase in throughput, there is a limit
due to the complexity involved in the decoder. There are two other methods of increasing
the throughput, implementing bit parallel operation in the syndrome calculation and root
finding, and implementing multiple BCH decoders in a system operating in parallel.
Bit parallel operation is a straightforward implementation and typically requires few
modifications to anoverall system to implement. However, as bit parallel operation increases
the complexity of the decoder, it decreases the achievable clock rate and thus has limits. Addi-
tionally, bit parallel operation cannot be applied to generating the error locator polynomial,
and thus the overall throughput of the system will come to be limited by this step.
Implementing multiple BCH channels bypasses these problems as it is simply a dupli-
cation of the BCH engine. Multiple channels require modification of the overall system to
implement and can be made in two primary situations.
The first is the case of a multi-channel architecture. For example, a system that has mul-
tiple data channels connected to flash memory (Abraham et al. 2010).
23
The second is to interleave the BCH code. Interleaving not only leads to increased
throughput, but also offers error correction advantages in certain types of channels (K. Lee
et al. 2010). This is because in many types of channels, errors tend to occur in bursts. With
interleaved operation the burst is broken up across many codewords, decreasing the proba-
bility that a single burst will overwhelm the error capability of the chosen BCH code (Shi
et al. 2004).
Bothmethods ofmulti-channel operation scale eachproperty of the system (throughput,
area, power) in a purely linear fashion.
3.2 Improving Efficiency
Improving the efficiency of each stage of decoding can lead to lower area requirements,
lower power consumption, and increased clock speeds leading to higher throughput. As
such, many ideas have been put forth to improve the efficiency of each BCH decoding stage.
For instance, it has been shown that a relation exists betweenmanyof the syndromes (Lin
and Costello 1983, p152). This makes it possible to only calculate a limited set of syndromes,
and then apply the relations to expand them into the full set of syndromes. This decreases
the overall area and power requirements of the decoder.
Additionally, it has been shown that there are multiple methods of finding each syn-
drome element (p165). For a given element, it can be shown which method is the most ef-
ficient. This information can then be used to calculate each syndrome in the most efficient
waypossible. This not only decreases the overall area andpower requirements of the decoder,
but because it decreases complexity, can also increase clock speeds and throughput.
Work has also gone into decreasing the complexity of bit-parallel LFSRs. This work can
be and has been applied to bit-parallel syndrome calculation.
24
As the step of generating the error locator polynomial can limit the overall throughput
of the decoder, improving its efficiency, increasing the achievable clock rate, and decreasing
the overall number of clock cycles required is important. General optimizations to finite
field operations, such as more efficient multipliers and dividers, can be applied to generating
the error locator polynomial.
Jamro (1997) has shown how linking multipliers which operate on different bases can
lead to a reducing in the number of clock cycles required. This is done by linking a serial
multiplier that takes parallel input and produces serial output with a multiplier that takes
serial input and produces parallel output. However, as these two multipliers operate on
a different bases, an efficient basis conversion circuit linking the two multipliers is shown.
Additionally, Jamro shows how the first two rounds of the algorithm can be skipped by pre-
calculating the necessary state of the registers. Both of these optimizations reduce the latency
of generating the error locator polynomial. By reducing the latency, this allows the decoder
to run at a higher overall throughput.
The Chien search requires a number of multipliers equal to the number of coefficients
in the error locator polynomial (Chen and Parhi 2004). Additionally, bit parallel operation
requires a duplication of this set of multipliers for each output bit as well as a multiplier to
load each coefficient with the appropriate value.
25
Because of this high cost in complexity and area, two complementarymethods have been
put forth for improvement. The first is to combine the multiple parallel Chien operations
together rather than considering them separately. Several multipliers are linked together se-
rially, and the intermediate stages are summed for each output bit. While this decreases com-
plexity, it greatly increases the critical path of the unit, decreasing possible clock rates. The
second, a complementary group matching scheme has been applied to this structure to re-
duce complexity and the critical path. The scheme exploits the substructure sharing within
a multiplier and among groups of multipliers (Chen and Parhi 2004).
26
Chapter 4
MAINOBSERVATIONS
In order to push uncorrectable error rate very low, BCHdecoders are very oversized com-
pared to the number of errors they typically correct. The common case is for only a fraction
of the decoder to be used. This is shown clearly in figure 7.
0 2 4 6 8 10
0
0:2
0:4
0:6
Number of errors in a single block
Pr
ob
ab
ili
ty
Figure 7. Probabilities of errors at BER of 110 4
At the error rate of 110 4, the decoder is required to correct up to 10 bit errors in order
to push the uncorrectable bit error rate below 110 15. However, the probability that any
errors occur in a block of 4096 bits is less than one in three. This means that in a multi-
channel decoder, on average only a third of the decoding hardware is required. Moving
beyond that, the probability that the entire error correcting capability of a single decoder
will be required is exceedingly small, around 1 in 30 billion.
27
This observation alone does not allow us any improvement because at any time the full
decoder may be required. I instead observe that on average only a small percentage of the
decoder is required and then apply that observation to amulti-channel decoder. By applying
this observation to a multi-channel decoder, at least one full BCH decoder must always be
included. The remainder of the decoding hardware can be reduced decoders of some kind.
These reduced decoders can reduce overall hardware requirements greatly.
To route data to the correct decoding block, the number of errors contained within a
block must be considered. The result of the syndrome calculation can be used to determine
if a block has any errors. All blocks must then at a minimum be passed through the syn-
drome calculation block. If the syndromes do evaluate to zero, then no further processing is
necessary for that block.
To calculate the number of errors beyond zero, the error locator polynomial must be
solved. Any reduction in the complexity of the decoder beyond zero errors must then be
in the root search. The case of only one error is a very common case and a good target to
optimize for. The optimization here is fairly straightforward as the error locator polynomial
will only be one degree in this case. Rather than a brute force search, the root can be found
algebraically.
The trade-off with such a system is that there is a possibility that insufficient resources
will be available to decode a certain set of blocks. If this occurs, decoding will be delayed un-
til resources are available and performance will be degraded. Fortunately, it is fairly straight-
forward to calculate this performance drop and thus intelligently trade-off a small drop in
performance for a large reduction in hardware requirements.
28
Chapter 5
MY APPROACH
This section reviews mymethods of acting on my observations. I first lay out the design
of the decoder architecture. Thedecoder architecture is designed as pools of hardware blocks.
This allows the pools to be sized appropriately and data to be assigned to units in each pool
as they become ready. The design of a reduced root solver for blocks with only one error is
also shown. Second, I show how the correct number of units can be chosen in order tomeet
a target miss rate.
5.1 Architecture
The basic design of a BCH decoder is broken down into three pipeline stages. For my
multi-channel architecture, I implement those stages as stations fed by round robin arbi-
trators. The arbitrator collects data from each stage and then passes it to the next. The
general layout of the decoder is shown in figure 8. In the example configuration, there are 3
error polynomial generator units (), one traditional Chien solver (C) and two reduced root
solvers (r).
The overall architecture can be configured with the following compile time parameters:
• Number of channels.
• Number of error locator polynomial generators.
• Number of traditional Chien search units.
• Number of reduced root solver units.
29
S₀ S₁ S₂ S₃ S₄ S₅ S₆ S₇
4 4 4 4 4 4 4 4
₀ ₁ ₂
C₀ r₁ r₂
4 4 4 4 4 4 4 4
Arbitrator
Arbitrator
Figure 8. An example of the proposed BCH decoder
The parameters must be chosen based on the allowed miss rate.
5.1.1 Syndromes
For every channel, the syndromes must be computed. This means that the number of
syndrome units will be equal to the number of channels. I fix each syndrome unit to a chan-
nel and each unit contains a bit counter. The counter will be used to track how many bits
the unit has received and if the syndrome is ready.
On the input side, the syndrome unit contains two control signals. An input to indicate
that it should start accepting syndrome data, and an output that acknowledges that signal.
If the unit is busy or contains processed syndrome data, it will not acknowledge the start
signal.
On the output side, the syndrome unit contains an additional two control signals. One
signal indicates that the syndromeunit contains processed syndromedata. Theother control
signal is an input that clears this state and allows the unit to accept new data.
30
Each unit can be configured with the following compile time parameters:
• Bit width.
• Code block size and number of correctable errors.
• Additional pipeline stages to meet timing.
• Additional register duplication to meet timing.
5.1.2 Syndrome/Error Locator Polynomial Interconnect
This interconnect passes data from the channel syndrome units to the pool of error loca-
tor polynomial generators. The unit primarily consists of a register to hold the syndromes,
an index to the current syndrome input unit, and an index to the current error locator poly-
nomial unit. Both indexes operate in a purely round robin fashion. The unit also contains
a circuitry to check its currently stored syndrome against zero. It determines if it is necessary
to pass the syndrome data to the error locator polynomial unit or if it can be skipped.
The general operation is to wait on the currently indexed syndrome unit. When a syn-
drome is ready, it accepts the syndrome and stores it in its syndrome register. It also stores
the index to associate the data with a channel. It then waits for the syndrome to be com-
pared against zero. If the check indicates no errors are present, it sets a flag indicating that
the current channel output should skip root finding for the next data set.
If the check indicates errors are present, it waits for the next error locator polynomial
generator unit to become ready. When ready, it passes its syndromes to that unit and sets
the start bit for that unit. It also passes the currently stored channel number so that the
error locator polynomial will be associated with the correct channel.
31
5.1.3 Error Locator Polynomial Generator
If any error exists within the codeword, the error locator polynomial must be foundThe
control signals on this unit are similar to the control signals on the syndrome unit. A start
and start acknowledge signal on the input, and a signal to indicate done state and a signal to
clear the done state on the output.
Theoutput of the error locator polynomial generator unit includes the error locator poly-
nomial and also the number of errors detectedwithin the codeword. The only configuration
available to the error locator polynomial are the BCH code parameters.
The general architecture of the unit follows that presented by Jamro (1997) but over-
comes two shortcomings. First it expands the basis conversion to support all pentanomials,
not just the single case supported by Jamro. This allows a wider range of BCH block sizes to
be tested. Secondly, the Jamro decoder requires a code generation step, and then the compi-
lation of that code. My solution is compile time configurable requiring no code generation
step. This allowed me to more easily debug timing issues and also shorten the overall devel-
opment and experimental gathering cycles.
The only compile time parameters for this unit are the BCH code parameters.
5.1.4 Error Locator Polynomial/Root Solver Interconnect
This interconnect is similar to the syndrome interconnect except that it must serve two
possible pools. The first destination pool consists of traditional Chien root solvers and the
second destination pool consists of reduced root solvers. When the currently selected error
locator polynomial is ready, the interconnect stores the error locator polynomial, the error
count, and the associated channel number.
32
The interconnect must then determine based on the error count which pool to serve.
It keeps two separate indexing counters, one for each pool. If the error count is 1 then the
reduced root solver pool is used, otherwise the traditional Chien pool is used.
When the appropriate root solver is ready, the interconnect signals it to start and passes
the error locator polynomial along with the associated channel number.
5.1.5 Traditional Chien Root Solver
The traditional Chien root solver units consist of a set of coefficient registers. Each regis-
ter is wide enough to contain a finite field element from the given BCH configuration. The
number of registers required is equal to the maximum number of errors that the code can
correct. The registers are each multiplied by the appropriate degree of x each cycle and each
cycle all registers are summed together. If the sum is zero, then an error has been located.
This operation is duplicated for bit parallel operation, with the number of bits shared per
register being configurable in order to meet timing. Additionally, the summing operation
provides an opportunity for a configurable amount of pipelining.
The unit contains a start signal that is used to load new values in the coefficient registers,
starting the algorithm. Due to the pipelined nature of the summation operation, an output
signal is provided that indicates that the first bit (or set of bits) of errors is being output on
the current cycle.
The glue logic surrounding the root solver contains a multiplexor that connects to the
busy signal of the output stages. The output stage then counts the number of cycles neces-
sary for the algorithm to complete.
Each unit can be configured with the following compile time parameters:
• Bit width.
33
• Code block size and number of correctable errors.
• Additional pipeline stages to meet timing.
• Additional register duplication to meet timing.
5.1.6 Reduced Root Solver
The reduced root solver can be used to find the error location for codewordswith a single
bit error. It offers large advantages over the traditional Chien search since it only requires a
single register. It also is more efficient in the multi-bit case as for each bit, since the register
is compared against a constant.
If only one error exists in a codeword, the error locator polynomial is of degree 1 and of
the form:
Ax+B = 0 (5.1)
Which can be solved in a single step as:
x =  B=A (5.2)
Because of the algorithm used to find the error locator polynomial, B is always 1. Addi-
tionally, negation is a null operation within finite fields. This reduces the equation further
to the form:
x = 1=A (5.3)
34
Although implementing an inverter would produce the value of x in a single cycle, the
value would be of little use on its own. This is because the value is in the standard basis for
the finite field and not the power form. The power formwould give us a direct integer index
to the location of the error. The binary representation of the sequencing of the standard
basis (polynomial form) can be seen in table 1.
Converting from the power form to the standard basis is an algorithmically complex op-
eration. It is generally on the order of O(N) where N in the number of elements in the
field. Rather than attempt to convert from the power form to the standard basis, I make
two observations.
My first observation is that we need to cycle through each bit in the codeword in order
to output error locations regardless of how the solver functions. My second observation is
summed up by the following re-arrangement:
Ax = 1 (5.4)
If we load a register with A and multiply it repeatedly by x1, it will eventually reach the
value of 1. Once it has we have multiplied A by the correct power of x and found the root.
Because we are only multiplying by x1 per cycle we can use a LFSR instead of a multiplier.
To start, we load the LFSR with the value ofA. Then during each cycle, we advance the
LFSR and compare the value with 1. If they match we have found the location of the root.
Expanding this to supportmultiple bits scales very well. We advance the LFSR a number
of cycles equal to the number of bits instead of just once. For each output bit, we compare
the value in the LFSR with the next value in the finite field starting with 1 for the first bit.
35
5.1.7 Output Units
Theoutputunitsmultiplex thedata fromthe root solvers andoutput it fromthedecoder.
Each channel has an associated output unit. The output units provide the data indicating
which bits are in error as well as a signal to indicate the start of a new block. Within each
output unit is a counter to keep track of when the output for the given block is complete
and the next block can be processed.
The output units are driven by two flags. One flag indicates that the output unit should
expect data from a root solver, the other flag indicates that the output unit should output
one block’s worth of error free data. Whenever the output unit completes its current block,
it examines these flags to determine what it should output next.
Whenever the flag indicating that data from a root solver should be processed, the associ-
ated index of that solver is stored aswell. This allows the output unit to assign itsmultiplexor
to accept data from the appropriate solver.
5.2 Determining the Number of Units
Part of the design is to select the appropriate number of each unit type. The number of
units included in a given design is determined by the expected error rate and the acceptable
miss rate. The miss rate indicates the likelyhood that within any given set of blocks, there
would be insufficient hardware to process the data. In this case the effected input channel is
stalled and the decoding of that block is deferred until hardware is available. This causes a
performance drop in decoding that is equal to the chosen miss rate.
36
The number of units required is decided in two stages. The first stage is the error locator
polynomial generator units. Units are only required for blocks with one or more errors.
Therefore the number of units is chosen based on the probability that more thanm blocks
contain one or more errors. We start by using eq. 2.2 to determine the probability that a
single block contains one or more errors. Then we plug this probability into eq. 2.8 and
choose themessage sizen to be equal to the number of channels. By evaluating this equation
for different values of m, we can find the number of blocks required to be below the miss
rate probability.
37
The result of evaluating this equation for the chosen set of BCH parameters and an ac-
ceptable miss rate of 2% is shown in figure 9.
1 2 3 4 5
10 10
10 7
10 4
10 1 2%
m
Pr
ob
ab
ili
ty
BER
510 6
210 5
510 5
110 4
Figure 9. Probability that more thanm blocks contain at least one error where n = 8
Figure 9 shows that for a BER of 510 6, only 1 unit is required with a 2%miss rate. For
a BER of 110 4, 5 units are required.
The next step determines the number of traditional Chien search units required. This is
calculated similarly to the above, but we examine the probability that more than one error
occurs since the reduced root solver can only handle one error. We first use eq. 2.8 to find the
probability that more than one error occurs within a single block. And then we use eq. 2.8
again, but this timewith themessage size set to the number of channels and p set to the value
found above.
38
The result of evaluating this equation across multiple values ofm is shown in figure 10.
1 2 3 4
10 19
10 14
10 9
10 4
101
2%
m
Pr
ob
ab
ili
ty
BER
510 6
210 5
510 5
110 4
Figure 10. Probability that more thanm blocks contain more than one error where n = 8
Figure 10 shows that for a BER of 110 4, 2 units are required. For all other examined
error rates only 1 unit is required. The remaining units are filled in with reduced root solvers.
Note that for any decoder at least one traditional Chien solver is required.
39
Amiss rate of 2% is chosen for my experiments as it is a very small performance penalty,
but still large enough for smaller unit counts to be used. In order to demonstrate the vari-
ability of units required for a given miss rate, the BER of 210 4 is examined. This BER
provides a wide range of required units across a set of given miss rates.
0 5 10 15 20
0
2
4
6
8
Miss rate
U
ni
ts
re
qu
ire
d
Unit type
Poly. gen.
Trad. Chien
Figure 11. Units required for BER of 210 4
As shown in Figure 11, the gain seen for a given miss rate falls off quickly beyond 2%.
Although 2% was chosen for the experiments in this thesis, the additional gains achieved
through much higher miss rates may still be desirable on extremely constrained designs.
40
Chapter 6
EXPERIMENTS
6.1 Setup
In order to test my ideas and approach, I have implemented them on a Field Pro-
grammable Gate Array (FPGA) in Verilog. A Xilinx Virtex-6 FPGA has been chosen as a
target as it has sufficient logic resources and Input/Outputs (IOs) for implementing the nec-
essary experiments.
The Verilog code is written to be configured through Verilog parameters. This allows
the properties of the decoder to be configured at compile time. The build tools can then
compile and verify a variety of configurations in a batch form without modification to the
codebase.
Validation of the design is performed with a series of testbenches. This verifies the cor-
rectness of the compiled code. The testbench operates by generating a stream of random
input data as well as random bit errors. In order to find problems with the code sooner, the
number of bit errors at or below the capability of the decoder are selected equally. The input
data is fed to a BCH encoder and then bits are flipped in accordance with the generated er-
ror locations. The modified data is then passed to the BCH decoder and the error locations
output are compared with the true error locations.
41
The area of a given design is calculated by implementing the design fully. All inputs and
outputs of the design are assigned registers as would be done in a system design to meet IO
timing. As the configurability of the design leads to a wide range of IO configurations, the
tool is permitted to automatically assign IO locations. The comparative area of design is
then measured by FPGA slice usage.
Power estimation is performedusing theXilinxXPowerAnalyzer. Since the static power
consumption of an FPGA does not vary significantly based on logic usage, dynamic power
consumption is compared.
Inorder to ensure a fair comparison, all designs are constrained to run at at least200MHz.
This ensures that complex designs will pay an area penalty as the tool will duplicate registers
to meet timing.
6.2 Baseline Configuration
The baseline configuration is an 8 channel decoder. Not only do many systems contain
a similar number of channels, but it also allows me to fully demonstrate the advantages of
my approach.
Each channel is 4 bits wide. Most flash memory systems operate in an 8 bit wide config-
uration, but a 4 bit wide configuration was chosen for two reasons. First to allow the design
to have headroom for demonstrating the increase in throughput possible in the optimized
design. Second, many decoders operate at a higher clock rate than the data bus. For a de-
coder operating at double the clock rate of an 8 bit data bus, 4 bit wide operation would be
required.
42
The baseline decoder operates on 4096 bit, or 512B blocks. This is a typical block size
for the error rates examined in this research (Cooke, Berrett, and Schulthies 2006; Cooke
2011). Similar results should be obtainable across a wide range of possible block sizes.
Flashmanufacturers typically do not reduce BER values for released flashmemory. They
instead release the error correction strength required to reduce the error rate below an accept-
able threshold, typically 1  10 15(Cai et al. 2012). Knowing the error correction strength
required, the block size, and the targeting uncorrectable error rate, we can work backwards
to estimate the associated BER. The values chosen are shown in the table 2.
Table 2. Targeted ECC range
Strength (errors) Estimated BER Bits of ECC required
5 5 10 6 65
7 2 10 5 91
8 5 10 5 104
10 1 10 4 130
6.3 Area Optimized BCHDecoder
The area optimized BCH decoder reduces the hardware area while it impacts perfor-
mance only 2%. The reduced number of units required is shown in the table 3. Although
8 syndrome calculation blocks are always required, the number of error locators and tradi-
tional Chien search blocks decreases as BER decreases to meet the given error rate with 2%
miss rate.
The area is then compared with the baseline decoder and the results are shown in Fig-
ure 12. Note that the area includes all hardware components such as arbitrators to build the
BCH decoders which are not required in the baseline implementation.
43
Table 3. Hardware units required for area optimized decoder
BER Syndrome Error Locator Traditional Chien Reduced Root
510 6 8 1 1 0
210 5 8 3 1 2
510 5 8 4 1 3
110 4 8 5 2 3
By optimizing the number of units and utilizing the reduced root solver my design can
reduce required area by 47%–71% compared to the baseline implementation.
510 6 210 5 510 5 110 4
0
1;000
2;000
3;000
4;000
5;000
BER
Ar
ea
(F
PG
A
sli
ce
s)
Config
Baseline
Optimized
Figure 12. Area saving results
44
The smaller area of the decoder also translates to dynamic power savings. By profiling
the designs I can estimate the power consumed by each design. The results are shown in
figure 13. This equates to a 44%–59% reduction in dynamic power requirements.
510 6 210 5 510 5 110 4
0
0:5
1
1:5
2
BER
D
yn
am
ic
po
we
r(
W
at
ts)
Config
Baseline
Optimized
Figure 13. Power saving results
6.4 Throughput Optimized BCHDecoder
While the proposed area optimized BCH decoder sacrifices a small amount of perfor-
mance to reduce the required hardware area, it is possible to devise a throughput optimized
BCHdecoder while holding area constant to improve the performance. The optimization is
achievedby increasing the bit parallel configurationparameter until amaximumthroughput
is found at the same area cost as the baseline configuration.
45
5 10 15 20
0
1;000
2;000
3;000
4;000
Baseline (4-bit)
Number of parallel bits
Ar
ea
(F
PG
A
sli
ce
s)
Figure 14. Requirements of 210 5 design
Figure 14 shows the process as applied to the 210 5 BER configuration. The area con-
sumed by the baseline unoptimized design is shown by the red line. The discontinuity in
results is due to an additional level of hardware duplication in the Chien search when mov-
ing to a 20 bit wide unit to meet timing.
While the unoptimized design consumes an area of 3168 slices, the optimized design con-
sumes an area of only 1272 slices. Both designs accept 4 bits per cycle in the syndrome calcula-
tion stage, and output 4 bits per cycle in their output stage. I then implement the optimized
design at 8 bits, 16 bits, 18 bits, and 20bits per cycle. These designs increase throughput by op-
erating onmore input and output bits per clock cycle. We can see that the optimized design
operating at 18 bits per cycle only consumes 2874 slices, which is less than the unoptimized
design operating at only 4 bits per cycle.
46
Thus it is possible to implement an 18-bit design within the same area, leading to a 4:5x
improvement in performance. Note that there is 2% performance degradation due to the
miss rate, which is negligible compared with the performance gain of 450%. Similar im-
provements in performance are possible with the other configurations and are shown in fig-
ure 15. The amount of performance improvement is related to the area savings provided by
the optimized decoder.
510 6 210 5 510 5 110 4
0
5
10
15
20
25
BER
Bi
t-p
ar
all
el
op
er
at
io
n
Config
Baseline
Optimized
Figure 15. Throughput optimization results
47
6.5 Flash Lifetime Optimized Design
Similarly, the area reduction can be utilized to increase the lifetime by providing higher
error correction strength. To provide stronger error correction, a larger hardware area is
required. The area reduction in my approach is utilized to provide greater error correction
strength in a smaller area. For an 8 channel unit and a 2% targeted miss rate, the hardware
area requirement in my approach for a BER of 110 4 becomes similar to the area in the
baseline approach for a BER of 510 6. Table 4 shows the units required in my approach
for different given BERs.
Table 4. Hardware units required for lifetime optimized design
BER Syndrome Error Locator Traditional Chien Reduced Root
1.210 4 8 6 3 3
1.510 4 8 7 3 4
2.010 4 8 7 4 3
Table 5 compares the error correction capability between the baseline approach and the
proposed optimizationwith a given hardware area constraint. For instance, for the hardware
area with which the baseline approach can handle a BER of 510 6, the proposed approach
can handle a BER of 110 4. Note that in the table, my approach requires no larger hard-
ware area than the baseline approach. In addition to increased error correction capability,
the implementation includes additional hardware units to meet the 2% miss rate. There-
fore, my approach can correct more errors than the baseline approach without sacrificing
performance, hardware area, and power consumption.
48
Table 5. BER achievable with lifetime optimized design
Original BER Original t Optimized BER Optimized t
510 6 5 1.010 4 10
210 5 7 1.210 4 11
510 5 8 1.510 4 12
110 4 10 2.010 4 13
Equation 2.11 shows the relation between BER and ageing. Since the proposed scheme
can correctmore errors, allowing adecoder targeted for a higherBER, the lifetimeof the same
NAND flash memory is prolonged compared with the baseline implementation. Figure 16
shows the lifetime improvement over the baseline BCH decoder. As a BER decreases, more
hardware reduction is achievable and more errors can be corrected by utilizing the reduced
area. The flash lifetime is extended by 1:4x–4:5x.
510 6 210 5 510 5 110 4
0
1
2
3
4
5
BER
O
pt
im
ize
dl
ife
tim
er
at
io
Figure 16. Improved lifetime
49
Chapter 7
CONCLUSIONAND FUTUREWORK
My research goal was to improve the efficiency of ECC systems by concentrating on
multi-channel BCH architecturs. In this thesis I have presented a novel multi-channel BCH
decoder optimization to reduce the hardware area requirement by considering a common er-
ror case. Theproposed schemeutilizes a pooled groupof shareddecodingblocks. Compared
with a traditionalmulti-channel implementation, it reduces the hardware area by 47%–71%.
The area reduction also saves the dynamic power consumption by 44%–59%. In my ap-
proach, if the reduced hardware area is utilized to increase the performance, the throughput
is improved by 3x–5x and the lifetime of NAND flash increases by 1:4x–4:5x if it is utilized
to correct more errors.
The approach does increase the complexity of the decoder by adding arbitrators between
pipeline stages, and incurs a small performance penalty if a miss occurs. However, the mas-
sive area savings provided are an excellent trade-off. Additionally, I’ve shown how the area
savings instead can be used to increase overall performance.
Although I have already achieved significant gains, additional work could lead to further
improvements across a wider range of bit error rates. Themost straightforward extension of
my work is to create additional types of root solver units. Additional root solver units could
be of two types, both direct algebraic solutions and modified Chien units.
Although complex, a direct algebraic method is available for solving second degree finite
field polynomials. This can be used to create a reduced root solver for handling blocks with
two errors, whichwould be useful at higher error rates. Additionalmethods are available for
up to degree 10, but grow quickly in complexity (Zinoviev 1996).
50
It is also possible to modify Chien units to reduce their error correcting capacity. Units
could be sized to correct up to a maximum number of errors. These reduced Chien units
would only contain a subset of registers and multipliers and would scale linearly with their
error correction capability. They would not offer the large savings seen by the reduced root
solver, but would be fully configurable leading to applicability across a wider BER range.
A second improvement may be a speculative error polynomial generator. The generator
couldbe sized such that it only supports up to a certaindegree of polynomial. Similarly to the
reducedChienunit, the size of theunitwould scalewith themaximumnumber of errors that
could be corrected. Because the generation of the error locator polynomials calculates the
number of errors in the block, the step would need to be performed speculatively. If during
calculation the number of errors exceeded the capacity of the current generator, calculation
would need be restarted by a full unit.
These improvements would extend this work to optimize higher bit error rates. Al-
though the improvements are untested, they warrant further study.
51
REFERENCES
Abraham, Michael, et al. 2010. “NAND flash trends for SSD/Enterprise.” Flash Memory
Summit.
Berrou, Claude, and Alain Glavieux. 1996. “Near optimum error correcting coding and de-
coding: Turbo-codes.” Communications, IEEE Transactions on 44 (10): 1261–1271.
Bose, Raj Chandra, and Dwijendra K. Ray-Chaudhuri. 1960. “On a class of error correcting
binary group codes.” Information and control 3 (1): 68–79.
Cai, Yu, Erich F.Haratsch,OnurMutlu, andKenMai. 2012. “Error patterns inMLCNAND
flash memory: Measurement, characterization, and analysis.” InDesign, Automation
& Test in Europe Conference & Exhibition (DATE), 2012, 521–526.
Chen, Yanni, and Keshab K. Parhi. 2004. “Small area parallel Chien search architectures for
long BCH codes.” Ieee Transactions on Very Large Scale Integration (VLSI) Systems
12 (5): 545–549.
Cooke, Jim. 2011. “NAND 201: An Update on the Continued Evolution of NAND Flash.”
Cooke, Jim, B. Berrett, and V. Schulthies. 2006. “NAND 101: An Introduction to NAND
Flash and How to Design It in to Your Next Product.”Micron: 1–28.
Epstein, Marvin A. 1958. “Algebraic decoding for a binary erasure channel.”
Forney Jr, G. David. 1973. “The viterbi algorithm.”Proceedings of the IEEE 61 (3): 268–278.
Gallager, Robert G. 1962. “Low-density parity-check codes.” Information Theory, IRE
Transactions on 8 (1): 21–28.
Garello, Roberto, Franco Chiaraluce, Paola Pierleoni, Marco Scaloni, and Sergio Benedetto.
2001. “On error floor and free distance of turbo codes.” InCommunications, 2001. ICC
2001. IEEE International Conference on, 1:45–49.
Hagenauer, Joachim, and Peter Hoeher. 1989. “A Viterbi algorithm with soft-decision out-
puts and its applications.” In Global Telecommunications Conference and Exhibi-
tion’Communications Technoloॽ for the 1990s and Beyond’(GLOBECOM), 1989.
IEEE, 1680–1686.
Han, Yunghsiang S., Carlos RP Hartmann, and Chih-Chieh Chen. 1993. “Efficient priority-
first search maximum-likelihood soft-decision decoding of linear block codes.” Infor-
mation Theory, IEEE Transactions on 39 (5): 1514–1523.
52
Hong, Jonathan, and Martin Vetterli. 1995. “Simple algorithms for BCH decoding.” Com-
munications, IEEE Transactions on 43 (8): 2324–2333.
Houghton, A. 2001. Error coding for engineers. Springer Science & Business Media.
Jamro, Ernest. 1997. “The design of a vhdl based synthesis tool for bch codecs.”The univer-
sity of Huddersfiel.
Jun, Zhang, Wang Zhi-Gong, Hu Qing-Sheng, and Xiao Jie. 2005. “Optimized design for
high-speed parallel BCH encoder.” In VLSI Design and Video Technoloॽ, 2005. Pro-
ceedings of 2005 IEEE International Workshop on, 97–100.
Justesen, Jørn,TomHøholdt, andChristianThommesen. 2004. “Decoding of concatenated
codes with interleaved outer codes.” In 2004 IEEE International Symposium on In-
formation Theory.
Kristian,Hans,HernandoWahyono,Kiki Rizki, andTrioAdiono. 2010. “Ultra-fast-scalable
BCHdecoderwith efficient-ExtendedFastChien Search.” InComputer Science and In-
formation Technoloॽ (ICCSIT), 2010 3rd IEEE International Conference on, 4:338–
343.
Lee, Je-Hoon, Sharad Shakya, Deepti Gupta, Ajay K. Sharma, Qin-li An, Jian-feng Chen,
and Zhong-hai Yin. 2013. “Implementation of Parallel BCH Encoder Employing Tree-
Type Systolic Array Architecture.”
Lee, Kihoon, Han-Gil Kang, Jeong-In Park, and Hanho Lee. 2010. “100GB/S two-iteration
concatenated BCH decoder architecture for optical communications.” In Signal Pro-
cessing Systems (SIPS), 2010 IEEE Workshop on, 404–409.
Lee, Youngjoo, Hoyoung Yoo, and In-Cheol Park. 2012. “Small-area parallel syndrome
calculation for strong BCH decoding.” In Acoustics, Speech and Signal Processing
(ICASSP), 2012 IEEE International Conference on, 1609–1612.
Lin, Shu, andDaniel J. Costello. 1983.Error control coding: fundamentals and applications.
Pearson-Prentice Hall Upper Saddle River.
Litwin, Louis. 2001. “Error control coding in digital communications systems.” RF Design,
July.
Luyi, Sui, Fu Jinyi, and Yang Xiaohua. 2012. “Forward error correction.” In Computational
and Information Sciencॸ (ICCIS), 2012 Fourth International Conference on, 37–40.
Marvell. 2014.Marvell’s Groundbreaking LDPC Technoloॽ Enablॸ TLC NAND Flash
With New Generation SATA SSD Controller.
53
McGregor, Andrew, andOlgicaMilenkovic. 2010. “On the hardness of approximating stop-
ping and trapping sets.” Information Theory, IEEE Transactions on 56 (4): 1640–1650.
Morelos-Zaragoza, Robert H. 2006. The art of error correcting coding. JohnWiley & Sons.
Puttarak, Nattakan. 2011. “Coding for storage: disk arrays, flash memory, and distributed
storage networks.”
Rate, Switch. 1983. “Forward error correction schemes for digital communications.”
Reed, Irving S., and Gustave Solomon. 1960. “Polynomial codes over certain finite fields.”
Journal of the Society for Industrial & Applied Mathematics 8 (2): 300–304.
Richardson,Thomas J.,MohammadAminShokrollahi, andRüdigerL.Urbanke. 2001. “De-
sign of capacity-approaching irregular low-density parity-check codes.” Information
Theory, IEEE Transactions on 47 (2): 619–637.
Saluja, Kevval K. 1987. “Linear Feedback Shift Registers Theory and Applications.”Depart-
ment of Electrical and Computer Engineering, University of Wisconsin-Madison: 4–
14.
Shannon, C. E. 1948. “AMathematical Theory of Communication.” Bell System Technical
Journal, number July: 623.
Shi, YunQ.,XiMinZhang, Zhi-ChengNi, andNirwanAnsari. 2004. “Interleaving for com-
bating bursts of errors.” Circuits and Systems Magazine, IEEE 4 (1): 29–42.
Strukov, Dmitri. 2006. “The area and latency tradeoffs of binary bit-parallel BCH decoders
for prospective nanoelectronic memories.” In Signals, Systems and Computers, 2006.
ACSSC’06. Fortieth Asilomar Conference on, 1183–1187.
Sun, Fei, Ken Rose, and Tong Zhang. 2006. “On the use of strong BCH codes for improv-
ing multilevel NAND flash memory storage capacity.” In IEEE Workshop on Signal
Processing Systems (SiPS): Design and Implementation.
Zambelli, Cristian, Marco Indaco, Michele Fabiano, Stefano Di Carlo, Paolo Prinetto,
Piero Olivo, and Davide Bertozzi. 2012. “A cross-layer approach for new reliability-
performance trade-offs in MLC NAND flash memories.” In Proceedings of the Con-
ference on Design, Automation and Test in Europe, 881–886.
Zinoviev, Victor. 1996. On the solution of equations of degree = =< 10 over finite fields
GF(2m). Technical report, Rapport de recherche/INRIA RR-2829. http://opac.inria.
fr/record=b1036369.
54
