We propose without loss of generality strategies to achieve a high-throughput FPGA-based architecture for a binary Quasi-Cyclic Low-Density Parity-Check (QC-LDPC) code based on a circulant-1 identity matrix construction. We present a novel representation of the parity-check matrix (PCM) providing a multi-fold throughput gain. Splitting of the node processing algorithm enables us to achieve pipelining of blocks and hence layers. By partitioning the PCM into not only layers but superlayers we derive an upper bound on the two-layer pipelining depth for the compact representation. To validate the architecture, a decoder for the IEEE 802.11n (2012) QC-LDPC is implemented on the Xilinx Kintex-7 FPGA with the help of the FPGA IP compiler available in the NI LabVIEW TM Communication System Design Suite (CSDS TM ). It offers an automated and systematic compilation flow where an optimized hardware implementation from the LDPC algorithm was generated, achieving an overall throughput of 608Mb/s (at 260MHz). As per our knowledge this is the fastest implementation of the IEEE 802.11n QC-LDPC decoder using an algorithmic compiler.
I. INTRODUCTION
For the next generation of wireless technology collectively termed as Beyond-4G and 5G (hereafter referred to as 5G), peak data rates of upto ten Gb/s with overall latency less than 1ms [1] are envisioned. However, due to the proposed operation in the 30-300GHz range with challenges such as short range of communication, increasing shadowing and rapid fading in time, the processing complexity of the system is expected to be high. In an effort to design and develop a channel coding solution suitable to such systems, in this paper, we present a high-throughput, scalable and reconfigurable FPGA decoder architecture.
It is well known that the structure offered by QC-LDPC codes [2] makes them amenable to time and space efficient decoder implementations relative to random LDPC codes. We believe that, given the primary requirement of high decoding throughput, QC-LDPC codes or their variants (such as accumulator-based codes [3] ) that can be decoded using belief propagation (BP) methods are highly likely candidates for 5G systems. Thus, for the sole purpose of validating the proposed architecture, we chose a standard compliant code, with a throughput performance that well surpasses the requirement of the chosen standard.
Insightful work on high-throughput (order of Gb/s) BPbased QC-LDPC decoders is available, however, most of such works focus on an ASIC design [4] , [5] which usually requires intricate customizations at the RTL level and expert knowledge of VLSI design. A sizeable subset of which caters to fullyparallel [6] or code-specific [7] architectures. From the point of view of an evolving research solution this is not an attractive option for rapid-prototyping. In the relatively less explored area of FPGA-based implementation, impressive results have recently been presented in works such as [8] , [9] and [10] . However, these are based on fully-parallel architectures which lack flexibility (code specific) and are limited to small block sizes (primarily due to the inhibiting routing congestion) as discussed in the informative overview in [11] . Since our case study is based on fully-automated generation of the HDL, we compare our results with another state-of-the-art implementation [12] in Section IV. Moreover, in this paper, we provide without loss of generality, strategies to achieve a high-throughput FPGA-based architecture for a QC-LDPC code based on a circulant-1 identity matrix construction.
The main contribution of this brief is a compact representation (matrix form) of the PCM of the QC-LDPC code which provides a multi-fold increase in throughput. In spite of the resulting reduction in the degrees of freedom for pipelined processing, we achieve efficient pipelining of two-layers and also provide without loss of generality an upper bound on the pipelining depth that can be achieved in this manner. The splitting of the node processing allows us to achieve the said degree of pipelining without utilizing additional hardware resources. The algorithmic strategies were realized in hardware for our case study by the FPGA IP [13] compiler in LabVIEW TM CSDS TM which translated the entire softwarepipelined high-level language description into VHDL enabling state-of-the-art rapid-prototyping. We have also demonstrated the scalability of the proposed architecture in an application that achieves over 2Gb/s of throughput [14] .
The remainder of this paper is organized as follows. Section II describes the QC-LDPC codes and the decoding algorithm chosen for this implementation. The strategies for achieving high throughput are explained in Section III. The case study is discussed in Section IV, and we conclude with Section V.
978-1-4799-8091-8/15/$31.00 ©2015 IEEE II. QUASI-CYCLIC LDPC CODES AND DECODING LDPC codes are a class of linear block codes that have been shown to achieve near-capacity performance on a broad range of channels. Invented by R. Gallager [15] in 1960, they are characterized by a low-density (sparse) PCM representation. Mathematically, given k message bits, an LDPC code is a nullspace of its m × n PCM H, where m denotes the number of parity-check equations or parity-bits and n (= k + m) denotes the number of variable nodes or code bits [2] . In the Tanner graph [2] representation, the i th check node CN is connected to the j th variable node VN if H(i, j) = 1, 1 ≤ i ≤ m and 1 ≤ j ≤ n.
A. Quasi-Cyclic LDPC Codes
QC-LDPC codes belong to the class of structured codes that do not significantly compromise performance relative to randomly constructed LDPC codes. The construction of QC-LDPC codes relies on an m b ×n b matrix H b sometimes called as the base matrix which comprises of cyclically right-shifted identity and zero submatrices both
B. Scaled Min-Sum Algorithm for Decoding QC-LDPC Codes
LDPC codes can be decoded using message passing (MP) or BP [15] , [16] on the bipartite Tanner graph. In this work we have employed the efficient decoding algorithm presented in [17] , with a pipelining schedule based on the row-layered decoding technique [18] , detailed in Section III-C. Definition 1. For 1 ≤ i ≤ m and 1 ≤ j ≤ n, let v j denote the j th bit in the length n codeword and y j = v j + n j denote the corresponding received value from the channel corrupted by the noise sample n j . Let the variable-to-check (VTC) message from VN j to CN i be q ij and, let the check-to-variable (CTV) message from CN i to VN j be r ij . Let the a posteriori probability ratio for variable node j be denoted as p j .
The steps of the scaled-MSA are given below. 1) Initialization: The a posteriori probability as the Log-Likelihood Ratio (LLR) p j for the VN j and the CTV messages are initialized,
where, 1 ≤ i ≤ m, and k ∈ N (i)\{j} represents the set of the VN neighbors of CN i excluding VN j.
Stopping Criteria: IfvH T = 0 or t = t max (maximum number of decoding iterations), declarev as the decoded codeword. We scale the CTV messages by a factor a (set to 0.75) to compensate for the performance loss due to the MSA approximation [19] . The above algorithm provides advantages [17] , [20] such as, a single processing unit is required for both CN and VN message updates, memory storage is reduced on account of the on-the-fly computation of the VTC messages q ij and the algorithm converges faster than the standard MP flooding schedule requiring fewer decoding iterations.
III. TECHNIQUES FOR HIGH-THROUGHPUT
In this section, without loss of generality, we present strategies to achieve a high-throughput FPGA architecture for a MSA-based QC-LDPC decoder.
A. Linear Complexity Node Processing
The hardware elements that process equations (2)-(4) are collectively referred to as the Node Processing Unit (NPU). In equations (2)-(4), the degree of CN i is d ci , the complexity of processing the minimum value (in terms of the comparisons required) is O(d 2 ci ). To achieve linear complexity O(d ci ), in our implementation the minimum value is computed in two phases or passes: first (global) pass where the first and the second minimum (the smallest value in the set excluding the minimum value of the set) are computed and the second (local) pass where the overall minimum is computed. A similar approach is found in [21] , [4] . Based on the functionality of the two passes, the NPU is divided into the Global NPU (GNPU) and the Local NPU (LNPU). A detailed algorithm is given in [20] .
B. z-fold Parallelization of NPUs
The CN message computation given by equation (3) is repeated m times in a decoding iteration i.e. once for each CN. A straightforward serial implementation of this kind is slow and undesirable. As no CN in the set of z CNs given by I s shares a VN with another CN in the same set, we operate z NPUs in parallel (hereafter referred to as an NPU array), resulting in a z-fold increase in throughput.
C. Layered Decoding
The flooding schedule [15] , becomes quickly intractable in hardware due to the complex interconnect pattern between the nodes of the bipartite graph and usually restricts itself to a specific code structure. In the efficient scaled-MSA algorithm (Section II-B) one can process multiple nodes at the same time if the following condition is satisfied. Fact 2. From the perspective of CN processing, two or more CNs can be processed at the same time (i.e. they are independent of each other) if they do not have one or more VNs (code bits) in common.
The subset of rows satisfying Fact 2 is termed as a row-layer (hereafter referred to as a layer). In other words, given a set [20] . From the VN or column perspective, |L u | = z, ∀u = {1, 2, . . . , I} implies that, the columns of H are also divided into subsets of size z (called as block columns from now on) given by the set B = {B 1 , B 2 , . . . , B J }, J = n z = n b . The VNs belonging to a block column may participate in CN equations across several layers. We call the intersection of a layer and a block column as a block. Two or more layers L u , L u are said to be dependent with respect to the block
D. Compact Representation of H b
Before we discuss the pipelined processing of layers, we present a novel compact (thus efficient) matrix representation leading to a significant improvement in throughput. We call 0 submatrices in H (corresponding to a −1 in H b ) as invalid blocks, since there are no edges between the corresponding CNs and VNs. The other submatrices I s are called valid blocks. In a conventional approach to scheduling for e.g. in [5] , message computation is done over all the valid and invalid blocks. To avoid processing invalid blocks, we propose an alternate representation of H b in the form of two (m β × n β ) matrices: β I , the block index matrix (Table I) and β S , the block shift matrix (Table II) locations and the shift values (and hence the connections between the CNs and VNs) corresponding to only the valid blocks in H b respectively. The algorithm to construct β I is given in [20] . To quantify the benefit of this alternate representation, let us define the following ratio.
Definition 2.
Let λ denote the compaction ratio, which is the ratio of the number of columns of β I (which is the same for β S ) to the number of columns of H b . Hence, λ = J n b . The compaction ratio λ is a measure of the compaction achieved by the alternate representation of H b . Compared to the conventional approach to scheduling node processing mentioned above, scheduling as per the β I and β S matrices improves throughput by 1 λ times. For example, in the case study in Section IV, λ = 8 24 = 1 3 provides a throughput gain of 1 λ = 3.
E. Layer-Pipelined Decoder Architecture
We now discuss the pipelined processing of blocks (hence layers) owing to the linear complexity processing shown in Section III-A. Table II shows one such rearrangement of β I for the QC-LDPC code for our case study. However, some dependencies still remain (shown in blue in Table II ).
Remark 1. Although such a rearrangement has be done (for e.g. [21] , [4] ), here we perform rearrangement on β I that allows for much fewer degrees of freedom as J = λn b . Thus, there is an upper bound on the extent to which two layers can be pipelined with minimum overhead, we derive this bound below. Fig. 1(a) shows the block-level view of the NPU timing diagram without the pipelining of layers. As seen in Section III-A, the GNPU and LNPU operate in tandem and in that order, implying that the LNPU has to wait for the GNPU updates to finish. The layer-level picture is depicted in Fig.  1(c) . This idling of the GNPU and LNPU can be avoided by introducing pipelined processing of blocks given by the following Lemma. Proof. Follows directly from the layer independence condition in Fact 2. showing the z-fold parallelization of the NPUs with an emphasis on the splitting of the sign and the minimum computation given in equation (3) . Note that, other computations in equations (2)-(4) are not shown for simplicity here. For both the pipelined and the nonpipelined versions, processing schedule for the block processing loop is as per Fig. 1(b) and that for the layer processing loop is as per Fig. 1(d) . Fig. 1(b) illustrates the block-level view of this 2-layer pipelining scheme. It is important to note that, the splitting of the NPU process into two parts (Section III-A), namely, the GNPU and the LNPU (that work in tandem) is a necessary condition for Lemma 1 to hold.
Definition 3. Without loss of generality, the pipelining efficiency η p is the number of layers processed per unit time per NPU array.
For the case of pipelining two layers shown in Fig. 1(d) , η
(2) p = |L| |L|+1 . Thus, we impose following conditions on |L|: 1) Since, two layers are processed in the pipeline at any given time, |L| ∈ F = {x : x is an even factor of I}. 2) Given a QC-LDPC code, |L| is a constant. This is to facilitate a symmetric pipelining architecture which is a scalable solution. 3) Choice of |L| should maximize pipelining efficiency η p , l * = arg max |L|∈F η p . For example, in the case study, I = m b = 12, F = {2, 4, 6} and, l * = arg max |L|∈F η p = 6.
High-level Architecture: The high-level decoder architecture for both the versions is shown in Fig. 2 . Code parameters such as β I , β s and n are stored in the ROM. A posteriori probability (APP) values and the CN messages are allocated separate RAM. The barrel shifter circularly rotates (z × f ) APP values to the right by using the shift values from the β s , effectively implementing the connections between the CNs and VNs. The GNPU and LNPU arrays compute messages which are stored back at their respective locations in the respective RAMs for processing the next block.
IV. CASE STUDY
We believe that the family of structured LDPC codes are highly likely candidates for 5G systems. Thus, to demonstrate the initial phase of our FPGA decoder architecture [20] , we provide a case study based on the QC-LDPC code specified in the IEEE 802.11n (2012) standard [22] . For this code, m b × n b = 12 × 24, z = 27, 54 and 81 resulting in code lengths of n = 24× z = 648, 1296 and 1944 bits respectively. Our implementation supports the submatrix size of z = 81 and hence is capable of supporting all the block lengths for the rate R = 1/2 code. At the time of writing this paper, we have implemented two versions, the 1x (non-pipelined, Fig. 1  (a) and (c)) and the 2x (pipelined, Fig. 1 (b) and (d) ). For both versions, input LLRs from the channel and the CTV and VTC messages are represented with 6 signed bits and 4 fractional bits. Fig. 3 shows the bit-error rate (BER) performance for the floating-point and the fixed-point data representation (0.5dB worse as expected, Fig. 3 ) with 8 decoding iterations. The decoder algorithm for both versions was described using the FPGA IP compiler [13] in LabVIEW TM CSDS TM . We would like to emphasize here that, both the versions were described in software at the algorithmic description level and not the HDL level. The algorithmic compiler translated the high-level description to an HDL description for the case study decoder implementation. The VHDL code was synthesized, placed and routed using the Xilinx Vivado compiler on the Xilinx Kintex-7 FPGA available on the NI PXIe-7975R FPGA board. The 2x version achieves an overall throughput of 608Mb/s with a clock rate of 260MHz and a latency of 5.7μs. As seen in Table  III , resource usage for the 2x version is close to that of the 1x version in spite of the 1.8x gain in throughput. The FPGA IP compiler chooses to use more Flip-Flops (FFs) for data storage in the 1x version, while it uses more BRAM in 2x version. A contemporary implementation of the IEEE 802.11n LDPC decoder using high-level algorithmic description compiled to an HDL on an FPGA [12] utilizes 2% of slice registers, 3% of slice LUTs and 20.9% of Block RAMs on the Spartan-6 LX150T FPGA with a comparable BER performance. [23] , [14] : An application of this work has been demonstrated in IEEE GLOBECOM'14 where the QC-LDPC code for our case study was decoded with a throughput of 2.06 Gb/s, validating the scalability of the architecture. 
Gb/s LDPC Decoder

V. CONCLUSION
In this brief we have proposed techniques to achieve highthroughput performance for a MSA based decoder for QC-LDPC codes. The proposed compact representation of the PCM provides significant improvement in throughput. An IEEE 802.11n (2012) decoder is implemented which attains a throughput of 608Mb/s (at 260MHz) and a latency of 5.7µs on the Xilinx Kintex-7 FPGA. The FPGA IP compiler greatly reduces prototyping time and is capable of implementing complex signal processing algorithms. There is undoubtedly more scope for improvement, nevertheless, our current results are promising.
