Quasi-cyclic low-density parity-check (QC-LDPC) codes are the choice for data channels in the fifth generation (5G) new radio (NR). At the transmitter side, code bits from the QC-LDPC encoder are delivered to the rate matcher. The task of the rate matcher is to select an appropriate number of code bits via puncturing and/or repetition. Code bits that are not selected do not need to be encoded. At the receiver side, the de-rate matcher combines code bits of different transmission attempts and sends them to the QC-LDPC decoder. The output of the QC-LDPC decoder only needs to include necessary systematic bits. Unnecessary systematic bits and parity bits can be completely removed from the decoding process. Taking these considerations into account, a smaller sub-base matrix instead of a full-base matrix can be used in the encoding and decoding process. In this paper, we propose an efficient implementation of QC-LDPC codes for 5G NR. The full-base matrix is pruned before being used. Compared to the traditional schemes, the proposed scheme improves the throughput of QC-LDPC codes in 5G NR.
I. INTRODUCTION
Fifth generation (5G) new radio (NR) is the next generation of mobile networks beyond the fourth generation (4G) long term evolution (LTE) [1] , [2] . 5G NR supports three scenarios: enhanced mobile broadband (eMBB), ultra-reliable and low-latency communications (uRLLC) and massive machine type communications (mMTC). These three scenarios have requirements that include low latency and high throughput [3] . The peak throughput requirement is 10 Gbps for uplink and 20 Gbps for downlink. The user plane latency requirement is 4ms for eMBB and 1ms for uRLLC. The control plane latency requirement is 20ms. Taking these requirements into consideration, quasi-cyclic lowdensity parity-check (QC-LDPC) codes are adopted by the 5G NR standard for data channels [4] , [5] . To simplify the implementation, QC-LDPC codes have the core part with a dual-diagonal structure and the extension part with a diagonal structure. Two base matrices, BG1 and BG2, are defined to guarantee the decoding performance for full ranges of transport block sizes and code rates. For BG1, the mother code rate is 1/3. For BG2, the mother code rate is 1/5. The associate editor coordinating the review of this manuscript and approving it for publication was Martin Reisslein .
In 5G NR, the physical downlink shared channel (PDSCH) and the physical uplink shared channel (PUSCH) are used for unicast data transmissions. The data from the medium access control layer to the physical layer is organized in the form of the transport block. Fig. 1 illustrates the processing of the transport block in 5G NR [6] . At the transmitter side, the following steps are carried out for a transport block: cyclic redundancy check (CRC) attachment, code block segmentation, QC-LDPC encoding, rate matching, bit interleaving and code block concatenation. The task of the rate matcher is to select an appropriate number of code bits from the output of the QC-LDPC encoder via puncturing and/or repetition to match the available radio resources. The redundancy version determines the exact set of code bits to be selected. As a consequence, code bits that are not selected do not need to be encoded. At the receiver side, the following steps are carried out for a transport block: code block segmentation, bit de-interleaving, de-rate matching, QC-LDPC decoding, code block concatenation, transport block CRC check. The de-rate matcher combines code bits of different transmission attempts and sends them to the QC-LDPC decoder. The output of the QC-LDPC decoder only needs to include necessary systematic bits. As a consequence, unnecessary systematic bits and parity bits can be completely removed from the VOLUME 7, 2019 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ decoding process. Taking these considerations into account, a smaller sub-base matrix instead of a full-base matrix can be used in the encoding and decoding process. The architectures of the QC-LDPC encoder include: dual-diagonal [7] - [10] , Richardson-Urbanke [11] - [13] , LU decomposition [14] , [15] , etc. In 5G NR, the dualdiagonal architecture is usually used to reduce the complexity of the encoder. The throughput of the dual-diagonal architecture is approximately inverse proportional to the row number of the base matrix. The architectures of the QC-LDPC decoder include: block parallel [16] - [18] , row parallel [19] , [20] , full parallel [21] , [22] , etc. [23] . In 5G NR, the block parallel architecture is usually used to reduce the complexity of the decoder. The throughput of the block parallel architecture is approximately inverse proportional to the number of circulant blocks in the base matrix. As a consequence, the throughput of QC-LDPC codes can be improved by using a smaller sub-base matrix instead of a full-base matrix.
Reference [24] only describes the encoding procedure for QC-LDPC codes in 5G NR. The scheme of constructing the sub-base matrix is not described. Reference [7] uses the full-base matrix in the encoding process. It is a waste of resources for the transmitter to encode unnecessary code bits. The sub-base matrix constructed in references [25] - [30] is a leading sub-base matrix [31] . That is, the sub-base matrix is obtained by selecting the intersection of the first i rows and the first j columns of the full-base matrix. The sub-base matrix constructed in [25] - [30] can be further pruned to increase the throughput. This paper proposes an efficient implementation of QC-LDPC codes for 5G NR. Unlike the traditional schemes, the sub-base matrix constructed by the proposed scheme is not restricted to the leading sub-base matrix. In addition, the proposed scheme takes into account the difference in the construction of the sub-base matrix between the encoder and the decoder. Compared to the traditional schemes, the proposed scheme improves the throughput of QC-LDPC codes in 5G NR. The rest of the paper is organized as follows: Section II describes the processing of the transport block in 5G NR. Section III gives the scheme of constructing the sub-base matrix for the encoder and the decoder. Section IV presents the architecture of the proposed scheme. Numerical results and computational complexity are shown in Section V and Section VI respectively. Finally, the conclusion is given in Section VII.
II. TRANSPORT BLOCK PROCESSING IN 5G NR
In this section, we focus on the processing of the transport block in 5G NR [6] . Let T be the transport block size. A A-bit CRC is attached at the end of the transport block, where A is equal to 24 if T > 3824 and 16 otherwise. The transport block, including the CRC, is divided into C equal size code blocks. C is equal to
where B is equal to T + A and ζ is equal to 8448 for BG1 and 3840 for BG2. The size of each code block is equal to
where τ is equal to 0 for C = 1 and 24 otherwise. Note that the procedure of the transport block size determination guarantees that B is divisible by C [32] . Let R be the target code rate for the initial transmission. 
where is the set of supported lifting size. includes all values of the form j × 2 i for j ∈ {2, 3, 5, 7, 9, 11, 13, 15} and i ∈ {0, 1, 2, 3, 4, 5, 6, 7} that range from 2 to 384. is categorized into eight sets according to j. There is an exponent matrix P j associated with j for each base matrix. We assume that Z can be expressed as m × 2 n . The full-base matrix P m is pruned to the sub-base matrix P m . The detail of the pruning is described in Section III. The parity check matrix H is constructed by lifting the sub-base matrix P m . That is, each non-negative element of P m is replaced by a Z × Z permutation matrix and each negative element of P m is replaced by a Z × Z zero matrix. If the size of the sub-base matrix P m is R × C, the size of the parity check matrix H is RZ × CZ . The task of the encoding is to find a codeword c that satisfies the following equation
where (·) T denotes the transpose of the enclosed vector. Let the codeword c be
where s = [s 0 , s 1 , . . . , s CZ −RZ −1 ] is the systematic bits and p = [p 0 , p 1 , . . . , p RZ −1 ] is the parity bits. c is written into a circular buffer of length N . To reduce the complexity of the implementation, the limit buffer rate matching (LBRM) is introduced in 5G NR. If LBRM is disable, N is equal to , where is equal to 66Z for BG1 and 50Z for BG2. If LBRM is enable, N is equal to
where R LBRM = 2/3. T LBRM is a function of the maximum number of layers L M , the maximum modulation order Q M and the maximum number of physical resource blocks P M [6] , [32] . The values of T LBRM are listed in Table 1 . LBRM is usually enable to reduce the size of the circular buffer and increase the throughput of the QC-LDPC codes. Let G be the number of code bits available for transmission of the transport block. Let L be the number of layers. Let Q be the modulation order. If code block groups (CBGs) are not supported, the rate matching output length E of the first C − mod(G/(LQ), C) code blocks is equal to E m = LQ G LQC (7) and that of the last mod(G/(LQ), C) code blocks is equal to
be the circular buffer. The initial values of the circular buffer are nulls. After writing the codeword c into the circular buffer, the values of d are as follows
where β = min(N , ξ + RZ ) and ξ is equal to 20Z for BG1 and 8Z for BG2. Then code bits e = [e 0 , e 1 , . . . , e E−1 ] are read out from the starting position k in the circular buffer, skipping the code bit with a value of null. k is a function of the redundancy version rv and is given by Table 2 . Next, bit interleaving is carried out with a row-column interleaver. The number of rows is equal to the modulation order. Code bits of each code block are written row-byrow into the interleaver and read out column-by-column. This increases the reliability of systematic bits and improves the performance of QC-LDPC codes [2] , [33] . Let f = [f 0 , f 1 , . . . , f E−1 ] be the output of the interleaver. e and f are related by
where i ∈ {0, 1, . . . , Q − 1} and j ∈ {0, 1, . . . , E/Q − 1}. Finally, code block concatenation collects the output of the rate matching. VOLUME 7, 2019 
III. THE SCHEME OF CONSTRUCTING THE SUB-BASE MATRIX
In this section, we give the scheme of constructing the sub-base matrix for the encoder and the decoder. Before delving into the details, let us explore the general structure of the base matrix. The structure of BG2 is illustrated in Fig. 2 . The structure of BG1 is similar. The base matrix consists of five submatrices: A, C, B, I and O. A and C are non-zero matrices. B has a dual-diagonal structure. I is an identity matrix. O is a zero matrix. A and B constitute the core part. C, I and O constitute the extension part. This structure is similar to QC-LDPC codes introduced in [34] . The core part can not be pruned [4] . The first two columns of the base matrix are not transmitted. This procedure improves the performance of the decoding [35] . The rows of C are designed to be orthogonal or quasi-orthogonal. Since the layered decoding is widely used in QC-LDPC codes [36] , [37] , this design reduces the decoding latency and improves the system throughput [38] , [39] . 
A. CONSTRUCTION OF THE SUB-BASE MATRIX FOR THE ENCODER
The output of the QC-LDPC encoder should consist of [s 0 , s 1 , . . . , s K −1 ] and some parity bits. In the following, we derive the parity bits that need to be included. In order to gain some insight, let us first focus on the case where k = 0. That is, the starting point of the selection is at the beginning of the circular buffer. Since the first 2Z systematic bits are punctured, the number of systematic bits to be selected is
and the number of parity bits to be selected is
As a result, parity bits should consist of [p 0 , p 1 , . . . , p δ−1 ], where δ is equal to
If N p is larger than N − ξ , the selection of code bits wraps around to the beginning of the circular buffer. Now, let us not restrict the value of k. That is, the starting point of the selection may be anywhere of the circular buffer. The number of systematic bits to be selected is
If k < ξ , the starting point of the selection is in the middle of the systematic bits. If k ≥ ξ , the starting point of the selection is in the middle of the parity bits.
The ending point of the selection in the circular buffer is
and δ e is equal to
If E > λ, parity bits should consist of [p δ s , p δ s +1 , . . . , p N −ξ −1 ] and [p 0 , p 1 . . . , p δ e −1 ], where δ s is equal to
The size of the systematic bits is at most (ξ + 2)Z . The last (ξ + 2)Z − K systematic bits are filler bits and are not transmitted over the air. Since filler bits are set to zeros, the values of parity bits are not affected by the last (ξ + 2)Z − K systematic bits. As a result, the columns corresponding to these systematic bits can be pruned.
If the full-base matrix P m is pruned to a sub-base matrix P m , P m and P m are related by where µ is give by (22) , as shown at the bottom of the previous page, and ν is equal to
The symbol ∪ means the union of two vectors [40] . The expression (21) means that P m is obtained by selecting the intersection of the rows µ and the columns ν of P m . Note that B can not be pruned. Two examples of constructing the sub-base matrix for the encoder are shown in Fig. 3 and Fig. 4 . From these figures, we see that P m is only a small portion of P m . Code bits that are not selected are not encoded. The throughput of the QC-LDPC encoder is expected to be increased.
B. CONSTRUCTION OF THE SUB-BASE MATRIX FOR THE DECODER
Hybrid automatic repeat request (HARQ) is widely used to improve the transmission efficiency [41] - [43] . Log likelihood ratios (LLRs) of the transport block received in error are stored in a buffer. The receiver generates the positive acknowledgement or the negative acknowledgement to drive the retransmission. When the retransmission is received, the decoding is performed by the buffered LLRs combined with the retransmission LLRs. As a result, the sub-base matrix used in the last transmission should be considered when constructing the sub-base matrix in the retransmission.
Let µ equals to ∅ for the initial transmission and the selected rows in the last transmission for the retransmission. Let ν equals to ∅ for the initial transmission and the selected columns in the last transmission for the retransmission. If the full-base matrix P m is pruned to a sub-base matrix P m , P m and P m are related by
where µ and ν are given in the preceding subsection. Two examples of constructing the sub-base matrix for the decoder are shown in Fig. 5 and Fig. 6 . From these figures, we see that P m is only a small portion of P m . Unnecessary systematic bits and parity bits are removed from the decoding process. The throughput of the QC-LDPC decoder is expected to be increased. In the initial transmission, the sizes of sub-base matrices of the encoder and decoder are the same. In the retransmission, the sizes of the sub-base matrix of the decoder is larger than that of the encoder.
IV. THE ARCHITECTURE OF THE PROPOSED SCHEME
In this section, the architecture of the proposed scheme is described. In the processing of the transport block, we need to know the lifting size, the size of the circular buffer, the type of the base matrix, etc. These parameters are derived from the scheduling information [2] , [6] . However, this derivation is not appropriate to be done in the hardware. Usually, these parameters are calculated by the software and then are passed to the hardware through the configuration [17] , [29] , [30] , [44] , [45] . Based on this configuration, the control signals are generate by the controller. The size of the configuration needs to be small as possible to reduce the memory. It is clear that the configuration includes the following fields: E, K , N , Z , the type of the base matrix, etc.
Compared to the traditional schemes, the proposed scheme mainly affects the QC-LDPC encoding module at the transmitter and the QC-LDPC decoding module at the receiver. For the sake of brevity, we only focus on these modules in this section.
A. ARCHITECTURE OF THE ENCODER
Before giving the proposed architecture, the number of bits needed to indicate the sub-base matrix P m is derived. These bits should be added to the configuration. In the following, we show that 13 bits are needed to indicate the sub-base
can be represented by 1bit, 6bits and 6bits respectively. From (22) , we see that µ can be easily obtained from the fields already exist in the configuration and b † . From (23) , we see that ν can be easily obtained from the fields already exist in the configuration and µ. As a result, only b † needs to be added to the configuration to indicate the sub-base matrix P m .
There are many dual-diagonal encoder architectures that offer a variety of parallelism orders [7] - [10] . The architecture with the high parallelism provides the high throughput. However, this benefit comes at the cost of increased area. The architecture supports various base matrices specified by the protocol [6] , [46] , [47] . The controller of the architecture obtains the particular base matrix P † m used for encoding through the configuration and then generates the control signals based on P † m , Z , etc. Then the codeword c is produced under the control of these signals. The existing architectures can easily support the proposed scheme by modifying the configuration and the controller. That is, b † that indicates the sub-base matrix P m is added to the configuration and the logic that generates the sub-base matrix P m is added to the controller. Compared to the overall area of the architecture, the area of the logic is negligible. Since the logic that generates the sub-base matrix P m is not on the critical path, this modification does not affect the operating frequency of the architecture.
For example, one proposed architecture is shown in Fig. 7 . The difference between the proposed architecture and the reference architecture [7] is in the configuration and the controller, which are marked in red in the figure. The area of this architecture can be reduced by using few cyclic shifts at the expense of lower throughput. The proposed architecture is briefly discussed as follows. First, the intermediate variable t is obtained by accumulating the cyclic shifts of the the systematic bits s in 4 clock cycles. t is stored in the core unit and s is stored in the systematic buffer. In the following, p = [p 0 , p 1 , . . . , p 4Z −1 ] is generated by the core unit in 2 clock cycles. p is stored in the parity buffer 1. Finally, p = [p 4Z , p 4Z +1 , . . . , p RZ −1 ] is obtained by accumulating the cyclic shifts of the the systematic bits s and p in (R − 4) clock cycles. p are stored in parity buffer 2. The proposed scheme needs a total of (R + 2) clock cycles.
B. ARCHITECTURES OF THE DECODER
Before giving the proposed architecture, the number of bits needed to indicate the sub-base matrix P m is derived. These bits should be added to the configuration. In the following, we show that 42 bits are needed to indicate the sub-base 
] be the bit masks of length 42. The ith element of b ‡ is equal to 1 if i is a member of µ ∪ µ − [0, 1, 2, 3] and 0 otherwise, where the symbol − means the difference of two vectors [48] . From (22) , we see that µ ∪ µ can be easily obtained from b ‡ since B can not be pruned. From (23) , we see that ν ∪ ν can be easily obtained from the fields already exist in the configuration and µ ∪ µ . As a result, only b ‡ needs to be added to the configuration to indicate the sub-base matrix P m .
There are many block parallel architectures [16] - [18] . The architecture supports various base matrices specified by the protocol [46] , [49] , [50] . The controller of the architecture obtains the particular base matrix P ‡ m used for decoding through the configuration and then generates the control signals based on P ‡ m , Z , etc. Then the systematic bits s is obtained under the control of these signals. The existing architectures can easily support the proposed scheme by modifying the configuration and the controller. That is, b ‡ that indicates the sub-base matrix P m is added to the configuration and the logic that generates the sub-base matrix P m is added to the controller. Compared to the overall area of the architecture, the area of the logic is negligible. Since the logic that generates the sub-base matrix P m is not on the critical path, this modification does not affect the operating frequency of the architecture.
For example, one proposed architecture is shown in Fig. 8 . The difference between the proposed architecture and the reference architecture [17] is in the configuration and the controller, which are marked in red in the figure. The proposed architecture is briefly discussed as follows. First, the log likelihood ratio (LLR) buffer is initialized to LLR values of each bit and the extrinsic buffer is initialized to zeros. Then, at each time instant, the LLR values corresponding to a non-negative element of the sub-base matrix P m are read from the LLR buffer and the extrinsic information corresponding to the same non-negative element of the sub-base matrix P m is read from the extrinsic buffer. The difference between the cyclic shifted LLR values and the extrinsic information is fed to the min-sum unit. When all non-negative elements in a row of the sub-base matrix P m are processed, the min-sum unit and the adder output the updated extrinsic information and the updated LLR values respectively. The updated extrinsic information is written back to the extrinsic buffer. And after passing through the de-cyclic shifter, the updated LLR values are written back to the LLR buffer. The decoded bits can be obtained by the hard decision on the LLR values. This decoding process continues until the parity check passes or the maximum number of iterations is reached.
V. NUMERICAL RESULTS
Numerical results are given in this section to compare the throughput and the block error rate (BLER) of the traditional schemes and the proposed scheme. The detail configurations are listed below. There is a total of 14 orthogonal frequency division multiplexing (OFDM) symbols in the slot. Demodulation reference signal (DMRS) is located in the first 2 symbols of the slot. Physical uplink shared channel (PUSCH) is located in the last 12 symbols of the slot. The transport block is mapped to the resource elements (REs) in a frequency-first manner. Note that the front-loaded DMRS and the frequency-first mapping enable the transport block to be processed on the fly. It is not necessary to collect all the symbols in the slot before starting the decoding the transport block. The number of physical resource blocks is P a . The number of layers is 1. The transport block size T is determined according the procedure in 5.1.3.2 in [32] . The LBRM is applied and T LBRM is equal to 1277992. Let ψ be the index of the modulation and code scheme, which is obtained from the [32] . The target code rate R and the modulation order Q are derived from ψ. The sequence of redundancy versions is 0, 2, 3, 1. The parameters of the initial transmission and the retransmission are the same except for the redundancy version.
A. THROUGHPUT PERFORMANCE
In this subsection, the throughput of the proposed scheme and the traditional schemes are compared. In 5G NR, the throughput is usually defined as the number of systematic bits transport blocks with the lower code rate over the same radio resources in the same time duration [51] . The effective code rate decreases as the number of transmissions increases. The peak throughput requirement is difficult to meet in the retransmission [28] . The proposed scheme can alleviate this problem to a large extent.
B. BLER PERFORMANCE
In this subsection, the BLER performances of the proposed scheme and the traditional schemes are compared. The parameters of the simulation are listed as follows. The decoding algorithm is the layered normalized min-sum [56] . The scaling factor is equal to 0.85. The maximum iteration number is equal to 12. ψ is equal to 27. From the [32] , we can derive that R is equal to 0.9258 and Q is equal to 8. The encoded bits are transmitted over the additive white Gaussian noise (AWGN) channel. Based on the scheduling information, we can derive that T is equal to 295176 and G is equal to 314496 when P a is equal to 273 and T is equal to 6400 and G is equal to 6912 when P a is equal to 6.
The BLER performances of the proposed scheme and the traditional schemes as a function of the signal-to-noise ratio (SNR) are illustrated in Fig. 17 and Fig. 18 . From these figures, we see that the BLER performances of the proposed scheme and the traditional schemes are almost the same. The difference between the BLER performance of the proposed scheme and that of the traditional schemes is negligible. These simulation results verify the correctness of the proposed scheme for the encoder and decoder.
VI. COMPUTATIONAL COMPLEXITY
The difference between the proposed scheme and the traditional schemes is mainly the sub-base matrix used for encoding and decoding. In this section, we consider the amount of computation required to obtain the sub-base matrix used for encoding and decoding. It is clear that the full-base matrix scheme can directly obtain the sub-base matrix used for encoding and decoding. To obtain the sub-base matrix used for the encoding, the leading sub-base matrix scheme requires up to 2 divisions, 4 comparisons and 6 additions and the proposed scheme requires up to 5 divisions, 8 comparisons and 17 additions. To obtain the sub-base matrix used for the decoding, the leading sub-base matrix scheme requires up to 2 divisions, 5 comparisons and 6 additions and the proposed scheme requires up to 5 divisions, 8 comparisons, 17 additions and 2 union operations. Note that the amount of computation required to obtain the sub-base matrix used for encoding and decoding is amortized over C code blocks and is negligible when C is large.
VII. CONCLUSION
In many applications, the code rate of the data transmission is larger than the mother code rate. In these case, substantial throughput improvement can be achieved by pruning the full-base matrix of QC-LDPC codes to the desired size. In this paper, a scheme is developed keeping in mind the difference between the pruning of the full-base matrix for encoder and that for the decoder. The transport block of higher code rate is encoded and decoded by using a smaller sub-base matrix instead of a full-base matrix. As a consequence, the computational efficiency and energy efficiency are improved. These features make the proposed scheme attractive for QC-LDPC codes in 5G NR.
HAO WU received the M.S. degree in communication and information systems from Tianjin University, in 2009. He is currently a Senior Engineer with ZTE corporation, where he develops technologies to improve the performance of wireless broadband communication systems. He has authored a number of articles. He holds a number of patents. His research interests include wireless communication systems, digital signal processing, and error control coding.
HUAYONG WANG received the M.S. degree in IC design engineering from Peking University, in 2007. He is currently a Senior Engineer with ZTE Corporation, Shenzhen, China. His research interests include wireless communications systems and VLSI design.
