With the ever-growing storage density, high-speed, and low-cost data access, flash memory has inevitably become popular. Multi-level cell (MLC) NAND flash memory, which can well balance the data density and memory stability, has occupied the largest market share of flash memory. With the aggressive memory scaling, however, the reliability decays sharply owing to multiple interferences. Therefore, the control system should be embedded with a suitable error correction code (ECC) to guarantee the data integrity and accuracy. We proposed the pre-check scheme which is a multi-strategy polar code scheme to strike a balance between reasonable frame error rate (FER) and decoding latency. Three decoders namely binaryinput, quantized-soft, and pure-soft decoders are embedded in this scheme. Since the calculation of soft loglikelihood ratio (LLR) inputs needs multiple sensing operations and optional quantization boundaries, a 2-bit quantized hard-decision decoder is proposed to outperform the hard-decoded LDPC bit-flipping decoder with fewer sensing operations. We notice that polar codes have much lower computational complexity compared with LDPC codes. The stepwise maximum mutual information (SMMI) scheme is also proposed to obtain overlapped boundaries without exhausting search. The mapping scheme using Gray code is employed and proved to achieve better raw error performance compared with other alternatives. Hardware architectures are also given in this paper.
Introduction
Nowadays, the ever-developing digital technologies enable us to achieve extremely high communication speed. However, traditional hard disk drive (HDD) can no longer meet the throughput and latency requirements of most state-of-the-art application scenarios. To this end, NAND flash memory, which is of lower access time, higher compactness, and less noise has become increasingly popular for storage market [1, 2] .
The past decade has witnessed the steady price fall of flash memory and is expecting further price-drop in the future [3, 4] . This trend has enabled solid state drive (SSD), which is mainly based on NAND flash memory, to occupy a large share of both business and consumer markets.
Contributions
To balance the performance and delay, this paper proposes a pre-check scheme based on polar code for MLC NAND flash. Our main contributions are notably.
• We propose the pre-check scheme to arrange pure-soft, quantized-soft, and binary-input polar decoders in different life-stages of SSD.
• We have proved that polar code is a balanced code for which each codeword contains an equal number of zero and one bits.
• We propose a well-designed hard-decision binary-input polar decoder. This decoder directly employs 1-bit hard results returned from the voltage detector and utilizes a single XOR gate to calculate loglikelihood ratios (LLRs).
• We compare the complexities of binary-input successive cancellation (SC) polar decoder, SC polar decoder, binary-input bit-flipping LDPC decoder, and layered BP polar decoder. Results show that binary-input SC polar decoder has the lowest complexity given a target error performance. Besides, it also has better performance than traditional hard-decision bit-flipping LDPC decoder.
• We propose a new quantized-soft polar decoder with refined boundary-defining scheme to improve the empirical method.
• We clarify that Gray code is the optimal scheme to map 2 bits in 1 cell.
Notations
Let L and L designate likelihood ratio (LR) and LLR, respectively. Sets are denoted by uppercase calligraphic letters as A. We indicate the probability density function (PDF) of a voltage distribution i by p (i) . The uppercase letter P designates probability cumulated by PDFs. The entropy function is H.
Paper outline
The remainder of this paper is organized as follows. Section 2 reviews the background of NAND flash and polar codes. Section 3 proposes the Gray mapping scheme and pre-check scheme. Three polar decoders are discussed in this section too. In Section 4, hardware architecture of the proposed binaryinput decoder is detailed. In Section 5, performance and complexity are compared for different decoders. Finally, Section 6 concludes this paper. Proof for Gray mapping scheme and the correction of previous work [10] are presented in Appendix.
2 Background of MLC NAND flash memory and polar codes
Modeling of NAND flash memory
Floating gate transistors constitute the NAND flash memory [1] . Programming is an operation which stepwise injects a certain quantity of charges to achieve a target voltage. Unavoidably influenced by multiple interferences, voltages will turn into wide ranges, which results in overlapped regions. The voltage distribution adopted in this work originates from [25] . Gaussian distribution is selected for both convenience and accuracy of modeling [26] .
For design purposes, each cell is initialized with 4 distributions away from each other. However, these distributions gets closer with increasing program/erase (P/E) cycles and multiple interferences. Raw error happens when overlapped regions emerges.
Basics of polar codes
Proposed by Arıkan in [17] , polar codes have the capability of achieving the symmetric capacity I(W ) of any given B-DMC W , so long as the code length N goes to infinity. To better understand polar codes, LLR-based min-sum SC decoding algorithm [27] is introduced below.
In an arbitrary code with parameter (N, K, A, u A c ), code length and information length are represented by N and K. Source vector, the input vector of SC encoder, is denoted by u N 1 , which consists of an information part u A and a frozen part u A c . Note that frozen bits u A c are usually set to 0.
The LLR-based min-sum SC decoding algorithm is defined as
The symbol L in (1) and (2) denotes LLR
where L (i)
The hardware architecture of this algorithm is explained in [19] .
Multi-strategy ECC scheme
In this section, we first demonstrate the adopted Gray mapping scheme. Then we propose the pre-check scheme with multi-strategy ECC and 3 corresponding polar decoders.
Gray mapping and detection
The programmed symbols for each state of MLC NAND flash memory are shown in Figure 1 . Note that a raw error happens when a state is mistakenly considered for its neighboring states. Moreover, S 1 and S 2 have 2-bit difference under direct mapping. Therefore, we should consider a mapping scheme that is capable of reducing raw error bits, which is shown in Figure 2 . To this end, Gray mapping with minimum difference between adjacent states is the optimal choice. The proof is shown in Appendix A. Gray Figure 3 Error correction module in SSD controller. Figure 4 Flow chart of pre-check scheme.
Control system
The overall architecture of error correction module is illustrated in Figure 3 . Polar decoder will encode the external bit stream into binary codewords. Then these codewords will be pairwise mapped to a certain voltage in each cell. To recover the stored data, the detector first senses a cell several times and compare the stored voltage to reference voltages. After that, pre-check scheme will determine which decoder should be picked and then process the comparison results to LLRs to feed corresponding decoders. Figure 4 illustrates the flow of each step in the pre-check scheme. Cell state will be checked at the beginning to determine which decoder should be picked. When cell distortion appears slight, the binaryinput decoder is chosen owing to its low decoding latency. When distortion is getting worse, soft-decision decoders should be selected to guarantee data integrity.
Pre-check scheme
This scheme aims to select an optimal decoder in accordance to the condition to meet the demand for storage reliability.
Assume the mean values of four states are V , 2V , 3V , and 4V respectively and the standard deviation is σ, which is identical for all distributions.
The cell state can be expressed as an equation set as
we can obtain intersections [R 1 , R 2 , R 3 ] between 4 distributions which are
Since mean values are uniformly distributed and standard deviations are identical, reference voltage R i is the mid-value between µ (i−1) and µ (i) . A raw error will occur when the sensed voltage gets across the reference voltage. For example, if a voltage of state p (0) is greater than R 1 , it is more likely to be considered as a voltage in p (1) (i.e., an error happens). Therefore, we can calculate the raw error probability for each overlapped region by
, leftmost and rightmost distributions,
, middle distributions.
where V is the distance between two adjacent distributions and σ is the standard deviation. In NAND flash memory, the values of √ V and σ change over time due to voltage shifting and cell distortion. Since P E is monotonically decreasing with
σ is decreasing over time (the experiment in [26] has shown that the signal-to-noise ratio (SNR) in the NAND flash memory degrades about 0.13 dB per 1k P/E cycles), the value of P E is increasing.
With numerical P E , we can set several thresholds to adjust the decoding scheme to satisfy performance requirements of the system.
Pure-soft decoder
The sensed voltage needs to be converted into digital LLR to feed the pure-soft decoder.
Given the model of NAND flash memory in Section 2, the whole voltage range can be described with 4 Gaussian distributions indicated by p (0) (x), p (1) (x), p (2) (x), and p (3) (x). To obtain the definition of LLR in NAND flash memory, there are some basic ideas that need to be clarified. Lemma 1. Polar code is a balanced code for which each codeword contains an equal number of zero and one bits. Proof. The codeword x N 1 and the i-th element of x (i) N are constructed as
where u N 1 is the source information, G N is the generator matrix and G (i) N denotes the i-th column of G N . With the property of multiplication in GF (2), whether x (i) N is 0 or 1 is only determined by the number of 1's in u N 1 whose corresponding places in G
Since some elements in G N are 0, only a part of elements in u N 1 participate in the calculation. In the example of (8), only u 3 and u 4 are concerned.
Assume that the number of 1's in G 
Song H C, et al. Sci China Inf Sci
October 2018 Vol. 61 102307:6
Voltage Lemma 1 is the foundation for LLR calculation in NAND flash memory. This a priori property guarantees the usage of Bayes law within LLR calculation for all the polar decoders discussed in this paper.
Lemma 2. For any stored bit b i , its LLR is defined as
where
Proof. According to the definition, the LLR of b i should be denoted by
However, considering the difficulty of directly acquiring the a posteriori probability p(b i |V d ), it is simple to transform (10) into the form of likelihood function according to the Bayes law as
where p(bi=1) p(bi=0) = 1, according to Lemma 1, and p(V d |b i ) is the summation of PDFs when b i is settled. Therefore, the LLR of b i is exactly the form in (9) . Note that O i and Z i will change according to the adopted mapping scheme. An example is shown in Figure 5 .
Quantized-soft decoder
The LLR calculation mentioned in Subsection 3.4 can achieve the best performance of error correction. However, it requires an accurate value of the sensed voltage, which is unrealistic in circuits. Therefore, a proper scheme which can balance the numerical accuracy and sensing latency is highly needed.
Problems in quantized-soft decoder
The main constrain is that the detector can only return a comparison result between the sensed voltage and pre-set references which we call "hard result", containing only 1-bit information.
This raises two problems. The first one is obtaining proper references (or boundaries). The definition of overlapped regions is crucial to calculate LLRs.
Another problem is the number of sensing operations. Considering that LLR contains information more than 1 bit, we need multiple sensing operations to convert hard results into LLR. An example is shown in Figure 6 [10] . 
Boundaries defined by constant ratio
In our previous work [28] , we adopted the boundary-defining scheme that was proposed in [9] and expanded in [10] . In this subsection, we show the basic idea in [9] and the re-derived quadratic equation set which differs from the equations in [10] . B 
where p (k) (x) is the k-th voltage distribution. Under Gaussian estimation, this calculation is significantly simplified compare with [9] . Let σ 2 k and µ k be the variation and mean value of p (k) , then we have
Eq. (13) is derived from (12) . The work of [10] does not show the derivation, whereas the result is slightly wrong. We add red corrections and further derive it in Appendix B.
Boundaries defined by stepwise mutual information
The boundary-defining scheme of constant ratio mentioned in Subsection 3.5.2 is effective to locate the overlapped regions. However, there still remains an unsolved problem that the value of R is mostly determined by empirical evidence.
A different scheme called maximum mutual information (MMI) is proposed in [29] which aims to set quantization boundaries that maximize the mutual information. MMI quantizes the whole voltage range into (M + 1) regions with M sensing operations.
However, MMI is a general case instead of an optimal choice for boundary selection because the mutual information defined in [29] is calculated for each region instead of original bits, whereas LLRs are calculated bitwise. In this work, we calculate mutual information for the most significant bit (MSB) and the least significant bit (LSB) separately, which we call stepwise mutual information (SMMI). Figure 7 shows the relationship between reference voltages and mapped bits. It is obvious that the judgement of the LSB only relates to 2 quantization boundaries q 3 and q 4 . Similarly, q 1 , q 2 , q 5 , and q 6 are responsible for sensing operation of the MSB. We take the LSB as an example to demonstrate the channel and the entropy calculation under SMMI strategy.
In Figure 8 , the whole range is separated into 3 quantized regions, hence this quantization model is equivalent to a 2-input, 3-output channel model with X ∈ {0, 1} and Y ∈ {0, e, 1} given in Figure 9 which is similar to the model of single-level cell (SLC) NAND flash memory with 2 reads in [30] . According to Lemma 1, X sends 0 and 1 under equal probability. Therefore, the mutual information I between X and Y is calculated as
For a settled voltage distribution, the mutual information between X and Y can be numerically maximized to obtain desired boundaries q 3 and q 4 that yield the SMMI.
Practical SMMI boundary calculation
In the MMI example shown above, a 4 input, 7 output MLC model shown in Figure 10 was adopted for illustration purposes. However, there are at least 3 sensing operations in 1 overlapped region in a practical control system as demonstrated in Figure 11 , where the intersections of two distributions in the middle are called "hard-decision boundaries" and {q i , i = 1, 2, . . . , 6} mentioned before are called "soft-decision boundaries". Channel models for the LSB and the MSB in this scheme are shown in Figures 12 and 13 .
The mutual information for the LSB in this case is calculated as 
and the mutual information for the MSB is calculated as Figure 12 Channel models for the LSB in practical scheme. Figure 13 Channel models for the MSB in practical scheme. Figure 14 (Color online) Detection in binary-input decoder.
LLR calculation
According to (9) , quantized LLRs are calculated as follows:
L LSB i and L MSB i designate LLRs of the LSB and the MSB of the quantization region R i . We take the LSB as an example to further explain (17) .
Under Gray mapping scheme in Subsection 3.1 (illustrated in Figure 14 ), p (2) (x) and p (3) (x) are 2 distributions where LSB= 1. Meanwhile, p (0) (x) and p (1) (x) are distributions where LSB= 0. Under this condition, the numerator in (17) which contains the integral with respect to x of PDF (p (2) (x) + p (3) (x)) over the interval R i represents the probability for LSB = 1. In this way, the denominator is the probability where LSB = 0.
Under Gaussian estimation, Q-function can easily calculate desired LLRs as
Binary-input decoder
A sensing strategy is shown in Figure 14 . Three reference voltages are denoted by V 0 , V 1 , and V 2 which separate 4 voltage distributions. The detector first compare current voltage with V 1 to decide the LSB and then with V 0 or V 2 to decide the MSB. Detailed description can be found in [28] . According to (4),û is judged by the sign bit of LLR. Therefore, hard results can be fully utilized since they can represent the sign bit of LLR. In other words, they can be transformed into a special form of quantized LLR consisting of only a sign bit, for which it is called "binary-input decoder".
Magnitude of LLR is not concerned in this scenario and only sign bits will participate in the subsequent calculation, which makes it possible to apply simple bit operations in hardware without adder-subtractors in traditional processing element (PE) design [18] . This design is hardware-friendly and will be further discussed in Section 4. 
4 Architecture of proposed binary-input decoders
Two's complement analysis
According to (1), Type I PE will result in 0 if LLRs are quantized to ±1. In other words, data transferred between entities in different levels are not completely in binary form and hence cannot be represented by a single bit. Therefore, 2-bit 2's complement is adopted for simplicity of logical functions and demand of indicating 3 possible LLRs {0, ±1}.
Input and output analysis

Type I PE
According to [18] , universal Type I PE based on min-sum SC algorithm is a series of half or full addersubtractors. Calculation of LLRs in (1) is significantly simplified under 2-bit quantization. Unlike universal Type I PE calculation with arbitrary inputs, binary PE has a limited input set I = {−1, 0, +1} which exhaustively lists all possible results. Suppose X and Y are two 2-bit operands, u is the last decoded bit which chooses the calculation pattern, and Z is the output. The mathematical function of Type I PE is
Note that the results of (19) may be ±2 and will be quantized to ±1 for simplicity of calculation. Therefore, all the possible results are listed in Table 1 and we can directly focus on the input and output by transforming Table 1 into 2's complement as shown in Table 2 . In particular, we can separate the MSB and the LSB of output Z and treat this PE as a combinational logic circuit with a 5-bit input (u, X M , X L , Y M , Y L ) and a 2-bit output (Z M , Z L ). Therefore, Table 2 is the truth table for this logic circuit which enables us to simply build corresponding logic functions.
Type II PE
The architecture of Type II PE is more straightforward. With binary input, Eq. (2) can be pruned to without obtaining the minimum of 2 inputs since their absolute values have already been quantized to 1.
Considering the property of multiplication, the output will be 0 once there exists a 0 in 2 inputs. Therefore, hardware architecture design can be simplified by independently considering inputs ±1. Note that both 2's complements of ±1 have the same LSB as 1 and the outputs can only be ±1, which means the LSB will constantly be 1. Therefore, we can extract the MSB to analyze the input and output (I/O). I/O analysis and the corresponding 2's complements have been shown in Tables 3 and 4 by adopting the method mentioned in Subsection 4.2.1.
We can conclude from Table 4 that the calculation of the MSB of Type II PE using 2's complement equals to an XOR operation. Therefore Type II PE can be pruned to an XOR operation in the MSB and a fixed 1 in the LSB.
Design of binary PEs
Design of binary Type I PE
Binary Type I PE can be treated as a combinational logic circuit based on the analysis in Table 2 .
In this part, variable settings in Subsection 4.2.1 are adopted and therefore X and Y are two binaryinput operands, the last-decoded bit u is a selection bit and the output is represented by Z. With 2-bit quantization for X, Y and Z, binary Type I PE consists of 5 inputs (u, X M , X L , Y M , Y L ) and 2 outputs (Z M , Z L ). The logical functions are listed as follows:
The gate-level circuit diagram of binary Type I PE is depicted in Figure 15 .
Design of binary Type II PE
The core of Type II PE design can be concluded into 3 key points based on the aforementioned I/O analysis.
(1) MSB of Type II PE's output can be simply calculated by an XOR operation under 2-bit 2's complement;
(2) The LSB of Type II PE's output is fixed to 1;
(3) The output will be 0 once there exists a 0 in the inputs. Architecture of binary Type II PE is shown in Figure 16 .
Performance assessment
In this section, we provide the error performance of different codes and discuss their complexities.
Settings of simulation
We adopt an (8192, 7168) polar codes using different inputs under MLC NAND flash memory channels. Besides, an (8192, 7168) QC-LDPC code using bit-flipping decoding algorithm is also used for comparison. The selection of information length if based on [31, 32] . We adopt a 2-bit/cell MLC NAND flash memory model [25] as the simulation environment. It is assumed that the mean value of Gaussian distribution for erase state which represents 00 is 0 volt and the target voltages of programming states are 3.25, 4.55, and 6.5 volt for symbols 10, 11, and 01, respectively. Standard deviations for each state are set to 2σ, σ, σ, and 1.4σ, where σ changes over time due to multiple interferences. Hard-decision boundaries in binary decoder are the 3 intersections between 4 Gaussian distributions, and SMMI is applied to obtain other soft-decision boundaries.
The binary-input decoder employees 2-bit quantized LLR. Floating-point LLR is used in quantized-soft decoders. The maximum iteration is set to 15 in hard-decision bit-flipping LDPC decoding. 
Simulation
The result is based on FER versus raw error probability and the design of x-axis is explained as follows. The MLC flash memory is modeled as 4 Gaussian distributions and has 3 hard-decision boundaries. In hard decoding, a raw error happens once the voltage in a Gaussian distribution shifts to its neighbouring distributions (i.e., crosses the left or right hard-decision boundary). Under Gaussian distribution, the raw error probability P can be calculated by Q-function.
In Figure 17 , binary-input polar decoder obviously outperforms the hard-decision bit-flipping LDPC decoder. With the increment of sensing operations, quantized-soft polar codes is capable of correcting more error bits than binary-input polar code, which assures the data stability of the whole system.
Complexity analysis
Decoding of polar code
The complexity of full size SC is N log 2 N , where N is the code length [17] . For LLR-based min-sum SC decoding, the decoder complexity is
• Type I PEs: (N log 2 N )/2 additions;
• Type II PEs: (N log 2 N )/2 comparisons/selection (equivalent of addition) and (N log 2 N )/2 sign bit multiplication (equivalent of XOR).
Overall, the decoding complexity is (N log 2 N ) additions (XOR is negligible compared with addition). For binary-input SC decoding, LLRs are quantized to ±1, which means the comparison in Type II PEs is no longer needed. Therefore, the overall decoding complexity is N log 2 N/2 2-bit additions and N log 2 N/2 XOR operations.
Decoding of LDPC code
Among various LDPC decoding algorithms, min-sum algorithm is the most widely used method [8] [9] [10] [11] . In this subsection, we adopt the complexity analysis of LBP decoding with min-sum algorithm in [33] . In this subsection, code length and information length are represented by N and K. Column and row weight are denoted by d v and d c .
For LBP decoding, the complexity in one iteration is Overall, the decoding complexity is (N − K)(2d c + 1) + 2N d v additions per iteration. According to [33] , the LBP decoding converges within 15 to 20 iterations (denoted by I) and average column weight d v = 3.9375 (when code rate R = 0.75). To this end, log 2 N is obviously smaller than d v I when N is less than 8K byte in storage system. Therefore the computational complexity of SC polar decoding is much lower than LDPC LBP decoding. The complexity of standard min-sum algorithm is similar.
For hard-decision bit-flipping decoding, the complexity in one iteration is • Syndrome calculation: (N − K)d c additions and multiplication in GF (2); • Number of unsatisfied parity checks: Ed c additions where E is number of 1's in the syndrome;
• (N − 1) comparisons (additions) to obtain the largest number of unsatisfied parity checks. The complexity for bit-flipping decoding is mainly determined by the (N − 1) comparisons, hence the overall decoding complexity is I(N − 1) additions in the worst case. In [8] , the iteration of modified gradient descent bit-flipping (MGDBF) decoder is set to 30.
Comparison of decoding complexity
When setting code length N to 8192, information length K to 7168, iteration I to 20, column weight d v to 4, and row weight d c to 30, the the decoding complexity is compared in Figure 18 . It is obvious that the proposed binary-input SC decoder has the lowest complexity. Moreover, polar codes using SC algorithm have much lower computational complexity compared with traditional LDPC codes using LBP decoding.
Conclusion
This paper demonstrates that polar coded scheme holds great promise for data stability of MLC NAND flash memory. First, the proposed multi-strategy pre-check scheme can well balance the error performance and decoding latency. Second, the binary-input decoder is also proposed to relieve the quantization burden of quantized-soft decoder, and lower the computational complexity compared with LDPC codes. Third, a new method named SMMI is proposed to calculate quantization boundaries without boundary searching. Finally, the Gray code has been proved the optimal mapping scheme in our system. 1 , D = 0 1 .
By making a full permutation for these states, we get 24 different schemes as shown in Table A1 . Each scheme has 2 rows and if 1 bit is different from its neighbouring bits in a row, we will call it a change. For example, the combination ABCD indicates the mapping scheme shown in Table A2 . In this case, the number of changes is 3. The statistical results are shown in Table A1 . We can conclude by enumeration that the number of changes is 3 if and only if the mapping scheme is in Gray code. Other alternatives' number of changes are 4 or 5.
We have already known that raw errors usually happen in overlapped regions. For a 2-bit cell, there remains 3 overlapped regions. Assume the raw error probability for each region is P 1 , P 2 and P 3 , respectively. Therefore, the expectation of raw 
whereas the expectation of all other alternatives is NA = αP1 + βP2 + γP3 (α + β + γ = 4 or 5, αβγ = 0). (A2) N A is absolutely bigger than N G . In other words, we can tell that Gray code is the best choice for mapping schemes.
