Abstract-In this paper we propose the construction of Spatially Coupled Low-Density Parity-Check (SC-LDPC) codes using a Quasi-Cyclic (QC) algorithm. The QC based approach is optimized to obtain memory efficiency in storing the parity-check matrix in the decoders. A hardware model of the parity-check storage units has been designed for Xilinx FPGA to compare the logic and memory requirements for various approaches. It is shown that the proposed QC SC-LDPC code (with optimization) can be stored with reasonable logic resources and without the need of block memory in the FPGA, and a significant improvement in the processing speed is achieved.
I. INTRODUCTION
Spatially Coupled LDPC (SC-LDPC) codes have recently drawn special interest in channel coding due to their excellent thresholds [1] . Their performances are achieved at large code lengths (over 100K) which results in increased decoding complexity [2] . Practical implementation of such large LDPC codes is a well known problem, particularly the storage of the parity-check matrix in hardware memory [3] . The structure of the matrix also significantly affects the complexity of the decoder. Quasi-Cyclic based LDPC matrices have proven advantages over unstructured (random) matrices in design complexity and encoding process [4] . They also enable collision-free parallel processing in the decoder. Therefore, it is of great interest to investigate both their applicability in the case of SC-LDPC codes and the scope for further improvements in decoding performance and reducing implementation complexity.
In this paper, we present the construction of SC-LDPC codes using QC techniques. The decoding performance of these codes are compared with that of standard LDPC codes. The inherent advantages of QC based codes and the diagonal structure of the SC-LDPC matrix are further explored to reduce the complexity of the decoder by reusing the circulants in the matrix. A detailed investigation is carried out to evaluate the performance in terms of bit error rate (BER) and hardware implementation complexity of these memory-optimized QC SC-LDPC codes. The FPGA resource requirements and speed of operation are compared with QC and PEG SC-LDPC by implementing a hardware model of these codes. Memory 
Check node
Bit node efficiency and speed improvements achievable by using the proposed optimized QC SC-LDPC codes are also analysed.
A sample Tanner graph structure of SC-LDPC codes is shown in Fig. 1 . The SC-LDPC code starts with the protograph of a standard (d l , d r )-regular LDPC code (see Fig. 1(a) ) where d l and d r denotes the average bit degree and check degree respectively. An SC-LDPC protograph is constructed by coupling a chain of 2L + 1 protographs of the regular LDPC code (see Fig. 1(b) ). A particular code of length N is formed from the (d l , d r , L) protograph chain by creating M copies of every node and edge in the coupled chain. If a particular bit node was coupled to a particular check node in the original chain each of its copies is then connected to one of the M copies of that check node in the final code. The choice of which copy each node is connected to will depend on the constructions of the code. Like any standard LDPC code, SC-LDPC codes can be constructed randomly or incorporating structure of some form. In the following section we present a QC algorithm for the construction of SC-LDPC codes. In a QC-SC-LDPC code the all of the M copies of a particular edge in the protograph are specified using a circulant. The length, N , and rate, r of the SC-LDPC code in Fig. 1(c) is computed as follows: 
II. MEMORY-EFFICIENT QUASI-CYCLIC SC-LDPC CODES
The quasi-cyclic (QC) [5] construction of LDPC code has various advantages including simple encoding [6] , parallel decoding and memory efficiency in storing the matrix elements in the decoder [7] . Hence it is a good idea to have SC-LDPC codes in QC form. The construction of QC SC-LDPC matrices is presented in [8] , [9] . [8] discuses the performance improvements achieved by spatial coupling particularily for quantum LDPC codes. Whereas [9] presents techniques to improve the upper bound on the minimum Hamming distance of members of the QC sub-ensembles. In this section we present an innovative technique to further optimize the QC based SC-LDPC codes to significantly reduce the hardware requirements of the decoder, especially the Block RAMs (BRAM) in an FPGA.
While designing an LDPC decoder, it is essential to store the structure of the LDPC matrix in the hardware to carry out the decoding process. The structure is normally stored in the form of BRAMs because of the enormous data required, particularly for large LDPC codes [10] [11] . Compared to the codes that are constructed using random or progressive edge growth (PEG) algorithms, the QC based technique requires significantly less memory to store the matrix structure. This is because the latter requires memory for storing only the circulant, instead of the entire matrix. However, a systematic shifting circuit is needed to shift position of circulant to realize the actual and complete LDPC matrix in the decoder hardware [10] . Therefore, the decoder using QC based LDPC matrices require less memory but uses additional logic elements for special operations.
The QC SC-LDPC matrix consists of a chain of circulants making up the (2L + 1) protographs. For a (d l , 2d l )-regular code, (ie, a rate 1 2 code) each protograph in the chain corresponds to a set of 2d l individual circulants. We say that a circulant X is x cyclic shifts of the binary identity matrix. So the first column of the circulant X will have a '1' entry in row x + 1, the second column of X will have a '1' entry in row x + 2 etc.
The staircase structure of the SC-LDPC codes offers a unique opportunity to optimize the QC based technique to further reduce the hardware resources required for the decoder. We investigate whether a fixed set of circulants in the matrix (i.e. fewer than 2L + 1 independent circulant sets) can be reused without affecting the decoding performance. To this end, we consider SC-LDPC codes constructed with one, two and three repeating sets of circulants. Fig. 2(a), Fig. 2(b) and Fig. 2(c) show one, two and three repeating sets respectively for a rate
A. The effect of circulant reuse on code girth For a quasi-cyclic SC-LDPC code with the reuse of u columns of circulants as in Fig. 2 we have the following constraint on its girth. Note that we consider column weights of 4 (or more) since, as will be shown later, these are the column weights we require for SC-LDPC codes.
Lemma 1: Given a quasi-cyclic SC-LDPC code with reuse u = 1, 2 or 3, with d l ≥ 4 and L ≥ 2. The girth of this code is at most 4 + 2u. Proof Sketch: We will consider the case r = 2 and examine the first column of circulants for each protograph in the chain. We assume that the first four circulants in the first column set are given by A, B, C and D and the first four circulants in the second column set are given by E, F , G and H. The columns are repeated as shown in Fig. 2 (b) and so we have the following parity-check sub-matrix corresponding to the first circulant column of the first four protographs:
We will say that these four columns of circulants begin in the j 0 -th, j 1 -th, j 2 -th and j 3 -th columns of H respectively. Similarly we will say that these six rows of circulants begin in the i 0 -th, to i 6 -th rows of H respectively. Consider the i 1 + bth row of H. H has a non zero entries in positions (i 1 + b, j 0 ) and (i 1 +b, j 1 +b−e). Consider the i 3 +d-th row of H. H has a non zero entry in positions (i 3 +d, j 0 ) and (i 3 +d, j 3 +d−e). Consider the i 3 + b − e + g-th row of H. H has a non zero entry in positions (i 3 + b − e + g, j 1 + b − e) and (i 3 + b − e + g, j 2 + b − e + g − b)=(i 3 + b − e + g, j 2 − e + g). Consider the i 4 +d−e+g-th row of H. H has a non zero entry in positions (i 4 +d−e+g, j 2 +d−e+g −d)=(i 4 +d−e+g, j 2 −e+g) and
There exists an 8-cycle connecting these eight non-zero entries as
Although the proof only requires that we find one 8-cycle which exists for any choices of a, b, . . ., h, in fact every column of H (and hence every codeword bit) is involved in 8-cycles in QC SC-LDPC codes with reuse-2 at column weight at least four regardless of which circulants are chosen. The proof for the cases u = 1 and 2 follows a similar argument. 
III. PERFORMANCE OF SC-LDPC CODES
The QC SC-LDPC codes were simulated to evaluate the BER performance on a binary input additive white Gaussian noise (BI-AWGN) channel. We used multi-edge density evolution to compute the threshold of SC-LDPC codes, shown in Fig. 3 , and noted that L ≥ 16 is necessary to achieve thresholds better than that of (3, d r ) standard LDPC codes and that d l = 4 has an improved threshold over d l = 3 as L is increased. Also, as SC-LDPC codes are know to have good performance for very long codes [1] we compared decoding performance with code length as shown in Fig. 4 . Given these results a fairly long SC-LDPC code of 100,360-bits (≈ 100K) with d l = 4, L = 96 and M = 260 is considered for our simulation results. However, we note that even longer codes would perform noticeably better. Simulations were carried out using software models on a BI-AWGN channel. The sumproduct algorithm was used for decoding with a maximum of 1000 iterations, and the simulation was run until at least 50 word errors were accumulated. Fig. 5 shows a comparison of quasi-cyclic SC-LDPC codes with different levels of circulant reuse. In the construction process for all codes the circulants are chosen to avoid girth 4 in the resulting matrix. As would be expected due to girth limitations, the quasi-cyclic matrix with reuse of one and two circulant columns have poor performance. However, by reusing three circulant columns, the BER performance we obtained is as good as the standard QC SC-LDPC codes.
Lastly in Fig. 6 we compare the BER performance of the quasi-cyclic SC-LDPC codes with non-quasi-cyclic SC-LDPC codes and with standard LDPC codes. The non-quasi-cyclic SC-LDPC codes are constructed randomly and using a PEG algorithm modified for spatial coupling. The standard LDPC codes and constructed according to (PEG [12] and QC [5] ). From Fig. 6 , it is clear that the waterfall regions of SC-LDPC are better than standard LDPC (as predicted by density evolution). Also, the performance of random, PEG and QC based SC-LDPC codes follow each other very closely.
IV. ESTIMATION OF HARDWARE REQUIREMENTS FOR
USING SC-LDPC CODES The new QC SC-LDPC codes are compared with random SC-LDPC codes by designing a decoder hardware model. The model consists of units that are essential for storing the contents of the LDPC matrix and generating the appropriate address locations in a decoder. The block diagram of the hardware model is shown in Fig. 7 . The hardware model consists of a clock and a synchronous reset as inputs. They also consist of an address counter and output controller for sequencing the input and output data respectively. The bit and check node address generators are responsible for storing the LDPC matrix information and generating appropriate address for the decoder. Note that the QC SC-LDPC codes require an additional Cyclic Shift unit to generate appropriate address based on the circulants for decoding, as shown in Fig. 7 . Whereas PEG or random based SC-LDPC codes does not require any such unit, since the complete set of matrix elements are stored in hardware memory. The hardware models for both Random/PEG and QC SC-LDPC codes have been designed and synthesized using Verilog HDL. The designs are placed and routed for Xilinx Kintex-7 FPGA (XC7K355T) with LDPC codes of length 100K and 64K. The estimation of FPGA hardware requirements and maximum clock frequency achievable for the designs are shown in Table I . As expected, a large number of BRAMs are utilized by the PEG based codes compared to QC SC-LDPC codes, to store the matrix elements. With slightly increased logic units -registers and look-up tables (LUT), the standard QC codes offer a significant saving (up to 43 times in the case of 100K code lengths) of BRAMs compared to PEG SC-LDPC codes. As stated earlier, the increased logic requirements are due to the Cyclic Shift unit in the QC based SC-LDPC hardware models. Further, it is also observed that QC SC-LDPC codes with reuse-3 do not require BRAMs, as the few number of circulant can be easily stored in LUTs. Elimination of BRAMs also results in a significant improvement in the speed of operation (up-to 40% increase in the maximum operating clock frequency) compared to PEG or standard QC SC-LDPC codes (for code length of 100K).
V. CONCLUSION
This paper has presented the construction of SC-LDPC codes using the QC technique. It demonstrates the threshold advantages achievable by SC-LDPC over standard LDPC codes. Memory optimized QC SC-LDPC codes are introduced to significantly reduce the complexity of the decoder for storing the matrix elements. It is shown that by reusing the circulant columns in the SC-LDPC matrix (with reuse-3), it is possible to obtain a memory efficient decoder without noticeably affecting the decoding performance. The advantages of using the optimized QC SC-LDPC codes have been demonstrated by designing a hardware model, which shows substantial reduction in memory requirements and a significant improvement in the operating clock frequency of the decoder.
