h THE RELIABILITY OF NAND flash memories has reduced due to continuous technology scaling and the use of multilevel per cell (MLC) approach. Typically, the uncorrectable bit error rate is specified as 10 À13 to 10 À16 by the storage device manufacturers [1] . Due to the reduced hardware complexity, hard-decision error-correcting codes (ECCs) have been widely used in NAND flash memory devices. However, as the raw bit error rate (RBER) is getting worse, more powerful ECCs are required. ECCs with soft-decision decoding algorithm show better errorcorrecting performance than hard-decision codes [2] . Among the soft-decision ECCs, low-density paritycheck (LDPC) codes provide very good errorcorrecting performance. Recently, many researchers have used LDPC codes for addressing the error correction in NAND flash memory [2] , [3] . However, in order to adapt LDPC codes for storage device applications, it is necessary to evaluate their errorcorrecting performance at very low frame error rate (FER). This evaluation requires the acceleration of two key functions: 1) the decoding algorithm; and 2) a proper channel model. In this article, we present an FPGA-based accelerator dealing with both functions: the decoding part supports very high rate and large block length LDPC codes [4] , while the additive white Gaussian noise (AWGN) model has been adopted as a good, low-complexity approximation of the channel model.
h THE RELIABILITY OF NAND flash memories has reduced due to continuous technology scaling and the use of multilevel per cell (MLC) approach. Typically, the uncorrectable bit error rate is specified as 10 À13 to 10 À16 by the storage device manufacturers [1] . Due to the reduced hardware complexity, hard-decision error-correcting codes (ECCs) have been widely used in NAND flash memory devices. However, as the raw bit error rate (RBER) is getting worse, more powerful ECCs are required. ECCs with soft-decision decoding algorithm show better errorcorrecting performance than hard-decision codes [2] . Among the soft-decision ECCs, low-density paritycheck (LDPC) codes provide very good errorcorrecting performance. Recently, many researchers have used LDPC codes for addressing the error correction in NAND flash memory [2] , [3] . However, in order to adapt LDPC codes for storage device applications, it is necessary to evaluate their errorcorrecting performance at very low frame error rate (FER). This evaluation requires the acceleration of two key functions: 1) the decoding algorithm; and 2) a proper channel model. In this article, we present an FPGA-based accelerator dealing with both functions: the decoding part supports very high rate and large block length LDPC codes [4] , while the additive white Gaussian noise (AWGN) model has been adopted as a good, low-complexity approximation of the channel model.
Currently, NAND flash memories are using page sizes of 4 and 8 KB. These page sizes are expected to increase in the next few years, making difficult both the design and implementation of long LDPC codes with high code rate. Kou et al. [5] and Li et al. [6] have given the systematic algebraic construction of LDPC codes. These LDPC codes have high code rate and have good error-correcting and error-floor performance. However, long LDPC codes require a large amount of resource on FPGA for high-throughput implementation. Moreover, the Euclidean geometry (EG) LDPC codes presented in [5] require a complex switching network in the decoder. Despite the large number of generalized FPGA-based implementations of QC-LDPC codes Editor's notes: In this article, the authors implement an FPGA simulator that accelerates the performance evaluation of very long QC-LDPC codes, and present a novel 8-KB LDPC code for NAND flash memory with better performance.
-Prof. Qiang Xu, The Chinese University of Hong Kong in the open literature, none of them deals with such QC-LDPC codes. Moreover, to the best of our knowledge, error performance evaluation is not reported for such large page size storage devices. Zhong et al. [7] have given the error performance of randomly constructed QC-LDPC codes up to the maximum block length of 2 KB. In this work, we have constructed and evaluated the performance of regular algebraic QC-LDPC codes for the page size of 8 KB of NAND flash memories. The contributions of this work are as follows:
• implementation of a generalized and highthroughput FPGA emulator for very high rate and large block length regular algebraic QC-LDPC codes; • less hardware resources as compared to the decoder proposed in [4] ; • use of high-level synthesis to implement a highquality AWGN channel; • construction of a novel algebraic QC-LDPC code for 8-KB NAND flash memory. The proposed code has less hardware complexity and improved performance as compared with the previously proposed EG-LDPC code [3] .
LDPC decoding and hardware acceleration
LDPC codes are characterized by a binary parity check matrix H with M rows and N columns. The H matrix is sparse and valid codewords x satisfy H Á x 0 ¼ 0, where x 0 is the transpose of x.
In the Tanner graph terminology, columns of H (associated with bits of x) correspond to variable nodes, and rows (associated with parity equations) correspond to check nodes. Degrees of check and variable nodes are equal to the numbers of ones along rows and columns of H , respectively. In structured LDPC codes, H is organized with partially regular submatrices, to simplify encoding and decoding procedures. In particular, QC-LDPC codes are a well-known class of structured codes, where H can be represented as an M b Â N b array of z Â z circulant permutation submatrices, with
Each submatrix is either a zero matrix or the superposition of w cyclic-shifted identity matrices (w ! 1 is referred to as the circulant weight of the code).
LDPC decoding is usually handled via the belief propagation algorithm or one of its approximations [8] . The decoding process iteratively updates bit error probabilities [usually represented as logarithmic likelihood ratios (LLRs)], which express both the value of codeword bits (sign) and their reliability (magnitude). The decoding algorithm can be seen as the repetitive exchange of messages between variable and check nodes, which leads to the progressive refinement of LLRs toward correct decoded bits.
FPGA-based LDPC emulators can be classified based on 1) the type of supported LDPC codes (structured, unstructured or both); 2) the architecture of the decoder (serial or parallel); 3) the decoding algorithm; and 4) the target application. An FPGA emulator for structured LDPC codes is presented in [9] . The emulator is able to achieve an average throughput of 1.35 Gb/s for the (2048,1723) Reed-Solomon LDPC code using a single partially parallel core based on the normalized min-sum algorithm [8] . However, the throughput of the decoder scales down increasing the circulant size.
Zhong et al. [7] and Cai et al. [10] investigated the error-floor performance of LDPC codes for the magnetic recording channel. In [7] , the authors implemented an FPGA-based simulator for hardwareaware and performance-oriented QC-LDPC codes. They constructed randomly high rate QC-LDPC codes with maximum circulant and column weight of 2 and 4, respectively. The block lengths of the codes used are from 4608 to 16384 with rates varying from 8/9 to 15/16. The maximum throughput is 360 Mb/s with iterative detection and decoding. In [10] , a high-throughput emulator is designed resorting to multicore processing. The system occupies three BEE2 boards, each containing five Xilinx Virtex-II Pro FPGAs, and implements 27 parallel LDPC decoding cores, with code of length 4923 and code rate 8/9. The throughput of each core is 175 Mb/s, achieving a total throughput of 4.725 Gb/s. However, the implementation of such large parallelism is difficult and costly for large block length LDPC codes.
The emulators discussed above have mostly targeted structured LDPC codes as they feature simple encoder and decoder architectures. Indeed, the circulant weight of structured LDPC codes is usually 1. For EG-LDPC codes [5] , the circulant weight is rather high and it is difficult to find a conflict-free memory mapping for these highcirculant-weight codes. Wang and Cui [11] have proposed a partially parallel decoder architecture for regular QC-LDPC codes. To manage high circulant-weight matrices they proposed to use one separate memory bank for each cyclically shifted identity matrix with a switching network between the memories and the variable-node and checknode processing units. However, as the circulant weight becomes high, a large number of memory banks are required and the complexity of the switching network increases as well.
LDPC emulator
This section presents the proposed FPGA emulator for regular QC-LDPC codes. The emulator is implemented on a Xilinx VC709 FPGA board, containing a Virtex-7 XC7VX690T FPGA device. The emulator does not require the use of hardwareaware QC-LDPC codes and achieves a throughput of more than 1 Gb/s. Moreover, the emulator can also be used to evaluate the performance of very high circulant weight LDPC codes, such as EG-LDPC codes.
A block diagram of the complete emulator system is shown in Figure 1a . The hardware is controlled by a GUI-based software running on a PC. The communication between the hardware and the PC relies on the RS-232 port. The Microblaze ðBÞ soft processor receives the configuration data and the control signals from the PC and sends them to the configuration and control registers block. Similarly, it also receives the data from the LDPC emulator hardware unit, through the configuration and control registers block, and sends them to the PC. The control signals, shown as dotted lines in Figure 1a , include the Start and Stop signals, for starting and stopping the simulation, the NextSNR signal for moving to the next signal-tonoise ratio (SNR) point, the Resume signal for resuming the simulation for a given SNR point, and the Done_config signal, which is asserted when the configuration of the hardware is completed for the current SNR point. The Read_data signal is driven by the LDPC Emulator Hardware module for reading the number of wrong frames, total iterations of the decoder and the current seed values of the noise and the source bit generators. The Read_data signal is activated after a specified number of frames is completed or the maximum number of wrong frames is reached.
The main blocks of the LDPC emulator hardware are shown in Figure 1b . It consists of the LLR generator, the LDPC decoder, the frame error counter, the total iterations and total frames counters (Frames_Iters cnt), and a control unit. The configuration data for the LLR generator include: 1) the initial values of the seed for the noise samples and the source bit generators; 2) the number of integer and fractional bits for the LLRs of the channel symbols; and 3) 1= and 2= values, where is the standard deviation of the noise at a given SNR point. Similarly, the maximum number of iterations and the number of frames, after which the value of total iterations is read, are applied to the LDPC decoder and Frames_Iters cnt modules, respectively. The total_frames output is read when the maximum number of wrong frames is reached. The decoder asserts the Done_dec signal after decoding completion of each frame and provides 1) the number of iterations (No_iters) used to complete the decoding; and 2) the output of the parity check circuit, which is high when the frame is not decoded correctly. These data are added to the current values of the total iterations and the wrong frames, respectively, by means of two counters.
LLR generator
The LLR generator hardware, shown in Figure 2 , produces the source bits, encodes the information frame, adds the Gaussian noise to model the AWGN channel, and generates the LLRs of the received bits. These LLRs are transferred to the decoder. As detailed in Figure 2 , source bit generation is implemented with P l source generators (SGs), SG 1 to SG P l , where P l is the number of LLRs generated in parallel by the LLR generator. For simplicity, the architecture of the LLR generator is shown for the case
The first P l À 1 SGs generate z bits, whereas the last SG generates ldr bits, where ldr is the number of redundant rows (linearly dependent rows) in the parity check matrix. These information bits are stored in the corresponding shift registers attached to the SGs.
After generating the information frame, the whole frame is encoded. The encoder takes z clock cycles to encode one frame and produces M b parity bits per clock cycle. These M b parity bits correspond to the last M b submatrices of the H matrix. After encoding the frame, the codeword bits c n are modulated: as an example, for the case of a single bit memory cell, simple two level amplitude modulation is used and mb n ¼ ðÀ1Þ
cn . The LLR for each codeword bit c n is calculated as follows:
For hardware implementation purpose, (1) is modified as follows:
Thus, if 1= is precomputed, then (2) needs a single multiplication. The single precision floating point AWGN generator proposed in [12] is used to produce high-quality noise samples. The AWGN generator is based on the Box-Muller algorithm and is implemented using the Xilinx high-levelsynthesis tool Vivado HLS 2014.2. The HLS tool provides the flexibility of specifying the throughput of the design and therefore, high-speed implementations can be obtained. The high-speed implementation and the high quality of the noise samples are very important for accurately measuring the BER or FER, especially at high SNR. The Flt_to_Fxd Converter module is devoted to represent the LLRs as Q bit fixed point values, where Int_Bits and Frac_Bits are the numbers of integer (excluding the sign) and fractional bits, respectively. The vld signal (as shown in Figure 2 ) is asserted whenever there are valid LLR values at the output of Flt_to_Fxd Converter module. There are 31 pipeline stages in the AWGN generator, seven pipeline stages in the floating point adder and multiplier, and 17 pipeline stages in Flt_to_Fxd Converter. Therefore, after asserting the Start_awgn signal, a valid LLR value appears at the input of the decoder after 55 clock cycles. The architecture works in pipeline, namely the SGs generate the source bits of the next frame while the LLRs are transferred to the decoder.
Decoder
The partially parallel decoder architecture for regular QC-LDPC codes presented in [4] has been reused to implement the decoder core. The initial architecture was conceived to support generic EG-LDPC codes with high code rate and circulant weight. Key architectural features include compile time flexibility, with respect to the size and rate of the selected code, high level of parallelism, both at the decoder level and inside the check node unit, support for layered decoding scheduling. Two important changes have been introduced in the decoder architecture. 1) To save hardware resources, right shifting in LLR memory due to pipeline stages in the decoder is avoided by connecting each processing element to the LLR memory through multiplexers. This results in a reduction of 50% resources of FPGA as compared to the decoder in [4] . 2) Decoding performance has been enhanced, by adopting the conditional variable to check node updating rule proposed in [8] , which avoids the performance degradation 
where VTC ðkÞ ij is the variable-to-check (VTC) message in k th iteration from j th variable node to i th check node, where 1 i M and 1 j N , respectively.
Results
In this section, FPGA implementation and simulation results of the two algebraic QC-LDPC codes developed for the 8-KB page size of NAND flash memory with 3450 (5%) spare bits are discussed.
The first code used is the rate 0.961, (69615, 66897) EG-LDPC code [3] . The 4095 Â 69615 parity check matrix of this code consists of a 1 Â 17 array of 4095 Â 4095 submatrices. The row and column weights are 272 and 16, respectively. There are 1377 linearly dependent rows in the matrix. For adapting this code to the 8-KB page size of NAND flash memory, 1361 zero bits are inserted at the beginning of the information bits and, therefore, the code becomes a (68254, 65536) shortened EG-LDPC code [3] . For the shortening process, the first 1361 locations of the LLR memory on the decoder side are initialized with the maximum LLR value.
The second developed code is the rate 0.96, (68544, 65861) algebraic QC-LDPC code, based on the construction method proposed in [6] . We choose the Galois field (GF) 449 and took two subsets of elements from this field, i.e., S1 ¼ f where is the primitive element of GF. The multiplication factor [6] is taken as 1. Based on these two subsets and , we constructed a 2688 Â 68544 parity check matrix that consists of 6 Â 153 array of 448 Â 448 submatrices. Each submatrix is a cyclically shifted identity matrix. There are five linearly dependent rows in the matrix. The row and column weights are 153 and 6, respectively. For this matrix, we obtain a (68219, 65536) shortened code by inserting 325 zero bits to the information bits. The shortening process is the same as for the (69615, 66897) code. We used the Xilinx Vivado 14.2 tool for all the design flow, including simulation, synthesis, mapping, and place and route of the whole system for the two codes. We targeted the Xilinx XC7VX690T FPGA device for the implementation of the system. For the (69615,66897) EG-LDPC code, we selected the parallelism factor for the decoder as 7 and therefore, the number of rows processed by a single check node unit of the decoder is N r ¼ 585. We used 7 b for the representation of LLRs, where 2 b are used for the fractional part. The resources consumed by the different modules and the overall resources of the hardware are shown in Table 1 . The floating point AWGN channel unit consumes only 1% of the slice LUTs of FPGA, while consuming more dedicated DSP slices. The maximum clock frequency is 100 MHz (9.9 ns). The parallelism factor at the LLR generator module, as mentioned in LDPC emulator, is taken as 17, i.e., P l ¼ N b and, therefore, 17 LLRs are generated in parallel and transferred to the LLR memory. As a consequence, it takes s ¼ 4095 þ 55 clock cycles to transfer all the LLRs of the channel to the decoder, where 4095 is the circulant weight and 55 clock cycles are required due to the pipeline stages in the LLR generator module. The throughput per iteration of the system for this code is also given in Table 1 and can be calculated as
where f clk is the clock frequency, s is the LLR transfer latency, and no iters is the number of iterations. For the second (68544, 65861) QC-LDPC code, we took the parallelism factor for the decoder as 6 with N r ¼ 448. We used 7 b for the LLR, where 0 b are used for the fractional part. The hardware resources for this code are summarized in Table 1 . The number of clock cycles required to transfer the LLRs for this case is equal to s ¼ 448 Â ceilð153=17Þ þ 55, where the same parallelism factor of 17 is taken for the LLR generator. The throughput per iteration for this code is also given in Table 1 . As can be seen from the table, this code is able to achieve a higher throughput than the EG-LDPC code due to the number of check nodes which are almost 65% less than EG code. Moreover, this code features the degree of the check node unit which is almost half the EG code, therefore, the resources consumed by the decoder are less than the ones required by the decoder of the EG code. The encoder of this code consumes twice the resources as compared to the encoder of the EG code. This is due to the fact that all the M b parity bits (M b ¼ 6 for second code) are generated in parallel and the encoding process is completed in 448 clock cycles. The resources can be reduced by generating one parity bit per clock cycle. In this case, the encoding will be done in 2688 clock cycles, which is still faster than the first code, i.e., 4095 clock cycles to generate all parity bits. Figure 3 shows the comparison of the FER performance of both codes. The normalization factor in the normalized min-sum decoding algorithm used for the EG-LDPC code is 0.25, whereas the used for the second code is 0.375. The maximum number of iterations of the decoder is set to 8 for both codes. The normalization factor is set based on the simulation results obtained from a software model written in C. One hundred wrong frames are observed at each SNR point except at a FER of 10 À9 and 10 À10 , where at least ten and four wrong frames are observed, respectively. The average number of iterations for the first and second code at higher SNR are observed to be 3.2 and 4.23, respectively. Therefore, an average throughput of 1.15 and 1.13 Gb/s is achieved for these codes. The FER of 10 À9 , which requires simulation of at least 10 10 frames, is achieved in seven days. Both codes do not show any error floor at a FER of at least 10 À9 . However, Figure 3 shows that the proposed code outperforms the EG code by 0.15 dB at a FER of 10 À9 . The software simulations show that the number of bit errors/block errors at different SNR points is also smaller for the proposed code as compared to the EG code. It also features less memory requirements and reduced decoder complexity, which are important features for the application to NAND flash memories. Table 2 shows the comparison of the block lengths, code rate, and throughput of our work with the state-of-the-art FPGA-based implementations. As can be seen from the table, the block length of the code is very high as compared to the previously reported work. Moreover, to the best of our knowledge, the high-speed FPGA implementation of very high circulant-weight LDPC codes is not addressed in the literature. The speed of our implementation is comparable to the speed of the fastest reported emulator (single core version). This speed can be increased by increasing P l . As an example, by taking P l ¼ 22 for the second code, which consumes only 71.26% LUTs of FPGA, the average throughput of 1.35 Gb/s can be achieved. This increase in P l requires two more AWGN generators, five more SGs, adders, multipliers, and Flt_to_Fxd Converter modules, respectively. As these modules use more DSP slices and less logic resources of FPGA, this increase in parallelism factor will result in slight increase in the percentage of the LUTs usage of FPGA. Moreover, the achievable throughput of all implementations reported in Table 2 scales linearly with the number of allocated cores.
FOR ADDRESSING THE error correction in 8-KB page size of NAND flash memories, we evaluated the performance of very high code rate (0.96) and large block length algebraic QC-LDPC codes through a generalized and high-throughput FPGAbased emulator system. We used two codes: the (69615, 66897) EG-LDPC code and the (68544, 65861) algebraic QC-LDPC code. Simulation results on AWGN channel show that these codes do not suffer from error floor at a FER of 10 À9 . Moreover, the proposed (68544, 65861) algebraic QC-LDPC code shows good error performance and reduced hardware complexity as compared to the EG-LDPC code. h h References
