Abstract: Quasi-cyclic (QC) low-density parity-check (LDPC) codes are famous for their excellent error correction performance and hardware friendly structure in NAND flash memory application. Array LDPC code is a type of highly structured QC-LDPC code that provides a good balance between performance and complexity. In this paper, a method is proposed for the construction of (18900, 17010) LDPC code that is based on the Latin square and an improved array dispersion strategy to achieve multi-column alignment of the structure. Compared with traditional design, the parallel hardware architecture reduces the number of barrel shifters by 32%. The corresponding ASIC implementation results show that the throughput of the proposed QC-LDPC code was up to 3.49 Gb/s and the throughput-to-area (TAR) of the proposed codes was significantly improved.
Introduction
NAND flash memories are widely used for storage in mobile devices. Owing to the growing demand for high-density storage capacity and throughput, multi-level cell (MLC) and trinary-level cell (TLC) [1, 2] techniques are used in flash memory. However, data reliability is reduced due to higher raw bit error rates (RBER) introduced by the increasing bit density [3] .
Error correction codes (ECC), such as BCH codes and LDPC codes are efficient approaches to guarantee the data reliability. Compared with BCH codes, LDPC codes yield superior error correction performance with parity bits of the same size [4, 5] . QC-LDPC code is known for its hardware-friendly structure and excellent error correction performance. Several studies contributed on the implementation of the QC-LDPC codes [6, 7, 8, 9] , including improving the decoding algorithm and optimizing hardware architecture. Few studies, however, have focused on increasing the throughput of the decoding architecture. In this paper, we aim at developing a parallel decoding hardware architecture to meet the throughput requirements of both the Toggle DDR 2.0 and the ONFI 3.0 NAND interfaces [10] .
The remainder of this paper is organized as follows. Section 2 introduces the column-based shuffle decoding algorithm, along with the challenges to parallel implementation. We also propose here the construction of a ð18900; 17010Þ QC-LDPC code. In Section 3, the overall parallel architecture and the results of its implementation are detailed. Finally, we offer our conclusions in Section 4.
are denoted by L w cv and L w vc respectively, where c ¼ 0; 1; . . . ; M À 1, and v ¼ 0; 1; . . . ; N À 1. Note that NðcÞ is the set of VNs connected to CN c and MðvÞ the set of CNs connected to VN v. The VNs are divided equally into G groups and each group N g contains N=G VNs, where g ¼ 0; 1; . . . ; G À 1. The following shows the message update between the CNs and VNs in the gth group in the wth iteration.
1. VNU operation: The V2C message L w vc is updated as in Eq. (1). (2), and a represents the scaling factor. N g is the set of VNs located in the gth groups for g ¼ 0; 1; . . . ; G À 1. N L is the union of N 0 ; N 1 ; . . . ; N gÀ1 , and N R is the union of N g ; N gþ1 ; . . . ; N GÀ1 .
The posteriori probability L app;v is computed as in Eq. (3) and used for hard decision. 
Construction of parallel dispersed array QC-LDPC
We adopt the Latin square algorithm to construct a base matrix W a of size z Â z, where z is 2 q ¼ 64. The entries A i;j of W a are the elements of GFð2 q Þ. For 0 i 2 q À 1 and 0 j 2 q À 1, A i;j represents a submatrix that is either an all-zero matrix or a cyclic-shift of an identity matrix of size 2 q À 1. We can illustrate the construction of the code as follows.
• Step 1: Select the upper-left corner of matrix W a to construct a base matrix H b as in Eq. (4), of size r Â s, where r 2 q , s 2 q , and set r ¼ 6 and s ¼ 60. in each column or row in expanded matrix. Therefore, matrix H also contains no 4-cycle after array dispersion. the storage component and passes through the DSU block, the VNU block, the CNU block, and is finally written back to memory. As is shown in Fig. 6 , the DSU updates the minimum value of the submatrix according to the entered C2V message and the index of the given column. Prior to being transferred to the VNU, the selected minimum values are aligned in the barrel shifter the shifting value of which is equal to A i;j . The CNU updates the C2V data according to the V2C messages from the VNUs and the prior C2V messages. The shifting value of each barrel shifter in the CNU is 63 À A i;j to align the different V2C messages. Each CNU sorter contains one four-to-two comparator to generate the first and second minimum values. Each sign update unit gets a new global sign. Compared with [6] , the proposed architecture costs twice selector resources and VNUs.
Barrel shifters and storage memory
In traditional architectures [6] , barrel shifters are present only in CNUs, and need to process 19 bits, consisting of two C2V messages (three bits per value), two address Fig. 6 . Architecture of a CNU and a DSU block in detail messages (six bits per value), and a global sign bit. The parallel architecture aligns the selected minimum value in the DSU first, and the V2C messages and sign bits in the CNU next. The width of each barrel shifter in the DSU is three bits message, and each barrel shifter in CNU processes three bits message and a sign bit. Table I lists a comparison of barrel shifters between the proposed architecture and that in [6] . Table II shows storage usage in the decoding architecture. Before decoding begins, the initial LLR message and the parity-check matrix are stored into a singleport RAM and ROM, respectively. Compared with [6] , a hard-decision RAM and a sign-and-check RAM are twice the width and half the depth in the proposed architecture. The hybrid storage architecture containing a two-port RAM and six registers blocks is also adopted.
Results of emulation
We implemented the proposed architecture on the Xilinx Virtex UltraScale XCVU440 FPGA emulation platform. The maximum number of iterations was set to 20 and early termination was adopted.
For comparison, we constructed an ð18900; 17010Þ LDPC code as in [6] without masking. The results of simulation with different DOPs are shown in Fig. 7 . The FER and BER of both hard-one-bit and soft-two-bit decisions of the initial LLR were considered. The results show that the parallel architecture did not sacrifice correction performance in terms of BER. 
Conclusion
In this paper, a 2 KB-long QC-LDPC was proposed based on the Latin square and a new array dispersion method. With its diagonal-like structure, the parallel architecture reduces the number of barrel shifters compared with the traditional design. The results of simulations show that the TAR of the proposed codes is significantly improved.
