Abstract-In low-density parity check (LDPC) decoder implementations, the architecture of the node processing units (NPUs) has a significant impact both on the hardware resource requirements and on the processing throughput. Additionally, some NPU architectures impose limitations on the decoder's support for intra-or interstandard LDPC code flexibility at run-time. In this brief, we present a generalised algorithmic method of constructing NPUs that support run-time flexibility while maintaining a low hardware resource requirement and high maximum operating frequency. FPGA-based synthesis results demonstrate that the proposed architecture offers a significantly improved hardware efficiency, when compared to two commonly employed alternatives.
I. INTRODUCTION
L OW-DENSITY Parity Check (LDPC) codes [1] exhibit error correction performance very close to the Shannon limit, whilst supporting a high degree of parallel processing. As a result, they are increasingly being adopted in modern communications standards, including IEEE 802.11n [2] , IEEE 802.16e [3] , and eMBB data in 3GPP 5G New Radio (NR) [4] .
As described in [5] , an LDPC code is defined by a sparse M × N Parity-Check Matrix (PCM) H, which may also be depicted graphically by a bipartite factor graph. Here, edges connect Check Nodes (CNs) and Variable Nodes (VNs), as dictated by the positions of the non-zero elements of H. The number of edges d c /d v connected to each CN/VN is known as that node's degree. LDPC codes may be decoded by iteratively generating and exchanging messages between the VNs and CNs, via the edges of the factor graph [6] . The specific operation of each node, the format of the messages, and the order in which they are exchanged, can vary between differing decoding algorithms and schedules. In hardware LDPC decoder implementations, the functions of the nodes are performed by Node Processing Units (NPUs). The internal construction of an NPU may be thought of as a logical structure comprising one or more 2-input 1-output subnode blocks [7] , [8] . These subnodes perform the specific calculations of the NPU, and so their design can vary between different LDPC decoding algorithms. In this brief, we consider the arrangement of subnodes within an NPU, in a manner that is independent of the internal subnode operation.
The majority of published LDPC decoders adopt a partiallyparallel architecture [5] , [9] , in which a number of parallel Check Node Processing Units (CNPUs) and Variable Node Processing Units (VNPUs) are time-multiplexed between different CNs and VNs respectively, using flexible inter-node routing. This approach benefits from a flexible trade-off between hardware resource usage and processing throughput, and the flexible routing facilitates support for multiple different PCMs [5] . However, partially parallel decoders require each NPU to be able to perform the function of multiple CNs or VNs. In irregular LDPC code decoders, or in flexible decoders that support PCMs having different node degrees, NPUs must therefore be able to process varying numbers of inputs and outputs. This requirement for run-time flexibility can drastically complicate the design of an NPU, increasing its hardware resource requirements and/or critical path length.
Against this background, in this brief we present a hardware-efficient flexible NPU architecture, with general applicability to many different NPU types and LDPC decoding algorithms. In Section II, we provide a discussion of other NPU architectures, and their suitability for flexible LDPC decoder implementations. The proposed algorithm and the resultant architecture are then presented in Section III, followed by implementation characteristics and concluding remarks in Sections IV and V, respectively.
II. FLEXIBLE NPU ARCHITECTURES
NPUs each have a number I of inputs and outputs, which respectively accept and provide the messages that are exchanged via the connected edges of the factor graph. Here, the extra input corresponds to the intrinsic channel information, while the extra output provides a posteriori information used to make a decision about the corresponding LDPC encoded bit. Again, each of the outputs depends on all inputs other than the corresponding input, with the exception of the a posteriori output which depends on all inputs, including the corresponding intrinsic input [5] . As mentioned in Section I, flexible NPUs must have the ability to represent multiple different VNs or CNs at run-time, varying the number of inputs and outputs used according to the degree of the current VN or CN being represented.
Some previously published block-parallel LDPC decoders [9] , [10] have solved this problem by employing serial NPUs, comprising only a single subnode. Here, each node having i inputs is processed using i − 1 clock cycles. Whilst this may be efficient in terms of hardware resource usage, it results in very low processing throughputs. Alternatively, parallel flexible NPUs may be implemented having the maximum number of inputs and outputs, namely I = D C for CNPUs or I = D V + 1 for VNPUs, where D C and D V refer to the maximum values of d c and d v within the set of supported PCMs, respectively. When used for processing nodes having lower degrees, the unused inputs may be effectively disabled by supplying them with a "null" value [11] - [13] . For the example of VNPUs processing messages in the form of binary fixed-point LogarithmicLikelihood Ratios (LLRs), each subnode performs the addition of two LLRs. Here, providing any input with a 0-valued LLR will nullify any effect it may have on the proceeding calculations. However, a null value does not always exist, as is the case for the NPUs of stochastic LDPC decoders [7] , [14] , in which the messages exchanged between nodes are represented by bit-serial Bernoulli sequences [7] . Furthermore, the use of null values results in inefficient energy usage, since this implies that every subnode is activated during every NPU activation, regardless of whether it contributes towards the decoding process.
Many previous LDPC decoder implementations have presented NPU architectures that are compatible with null inputs. Of these, the most efficient in terms of both hardware usage and propagation delay is to use a binary tree of subnodes to compute the combination of all I inputs, then perform the inverse subnode operation for each output in order to remove the contribution made by the corresponding input [11] . This architecture requires 2 × I − 1 subnodes and has a critical path of log 2 (I) + 1 subnode delays, but is only possible when the subnode operation is invertible. Alternative architectures must therefore be employed when the subnode operation is not invertible, such as the min operation of CNPUs performing the Min-Sum Algorithm (MSA) [13] . One rudimentary option, henceforth referred to as a multi-tree architecture, is to use I parallel binary trees to calculate the combination of each of the I possible sets of I −1 inputs [11] . This option provides the shortest possible propagation delay of log 2 (I − 1) subnode delays. In a naïve implementation, the hardware requirement of this architecture is I × (I − 2) subnodes, although it is possible to reduce this by reusing early subnode outputs between trees, as described in [11] . As an alternative, the forwards-backwards architecture [15] maximally reuses subnode outputs in order to minimise the required number of subnodes to 3 × I − 6 [8] . However, the forwards-backwards algorithm necessitates a long critical path of I − 2 subnodes, limiting the maximum operating frequency of practical decoder implementations.
A compromise between these two options is the dualtree architecture of [7] , [8] , however this has not previously been generalised to support an arbitrary and run-time flexible number I of inputs and outputs. Accordingly, in Section III we present an algorithmic method for constructing dual-tree NPUs. These flexible NPUs employ multiplexers to optionally bypass subnodes having unused inputs, ensuring only active inputs reach the NPU outputs. Doing so grants run-time flexibility without the use of null inputs, and allows energy savings to be made using clock-or power-gating on unused subnodes. We also propose a method of reducing the number of multiplexers required, depending on the numbers of inputs that are required by a particular supported set of one or more LDPC PCMs. This results in an NPU architecture that supports flexibility for any number of inputs and outputs, regardless of the function performed by the internal subnodes, and without requiring excessive hardware resources or propagation delay.
III. THE DUAL-TREE NPU ARCHITECTURE
In this section, we propose the generalised algorithm of Fig. 1 for creating an NPU having the proposed flexible dualtree structure. Fig. 2 then exemplifies an NPU having the proposed structure for the special case where the maximum number of inputs I is a power of 2, namely I = 16. Later, Fig. 3 will exemplify a more general case having I = 13.
In Fig. 2 , each shaded square represents a single 2-input 1-output subnode, while the NPU inputs in 0 to in 15 are depicted on the left, with corresponding outputs out 0 to out 15 vertically flipped on the right. When the number of active inputs and outputs i is less than I, data is provided to only the top i inputs, in 0 to in i−1 , and the corresponding results are presented on the bottom i outputs, out 0 to out i−1 . Note that Fig. 2 represents a D V = 15 VNPU, with channel input in 0 and decision output out 0 . Alternatively, the structure in Fig. 2 could represent a CNPU with a maximum degree of D C = 16, by changing the second input to the bottom subnode in the right-most stage to input in 1 rather than T 0 1 . The overall structure of the proposed architecture may be described as follows. The dual-tree topology is comprised of two main sections, summing and combining, each of which contains S = log 2 (I) −1 stages. Let As mentioned previously, the flexibility of the proposed architecture is facilitated by multiplexers placed at the outputs of key subnodes throughout the NPU. More specifically, during the summing section, multiplexers are placed at each intermediate signal T t s where t > 0; during the combining section, they are placed at each signal where t is even, and additionally at signals where t = 1. These bypass multiplexers optionally allow the corresponding intermediate signal to replicate a value from an earlier connected stage, rather than using the subnode output. This option is exercised when the subnode result would otherwise depend upon an unused NPU input. More specifically, each multiplexer has a bypass threshold b, and is used for bypassing the corresponding subnode when the number of active inputs and outputs satisfies i ≤ b. The value of b for each multiplexer is calculated in Fig. 1 , and presented Fig. 2 . In this way, the proposed NPU architecture offers full flexibility for any number of inputs i ≤ I, by using only an additional 2 × (I − 2) multiplexers.
The algorithm of Fig. 1 can also extend the architecture of Fig. 2 to less regular cases having any maximum number of inputs I. In this generalised case, 3 × I − 6 subnodes are required by the proposed architecture, which is identical to the requirement of the forwards-backwards architecture [15] . However, the proposed architecture has a shorter propagation delay, whilst also facilitating the flexibility mentioned previously. More specifically, the number of subnode delays D of the forwards-backwards architecture equals I − 2, whereas for the dual-tree architecture it is calculated according to Furthermore, motivated by the observation that full flexibility is not required for most applications, the algorithm of Fig. 1 can also optimise the architecture of Fig. 2 to only include the multiplexers that are actually required to support a reduced set of NPU degrees. For example, Fig. 3 depicts a VNPU which can provide the functionality of any variable node within the 12 PCMs of the IEEE 802.11n LDPC code [2] . More specifically, this I = 13 VNPU supports any number of inputs and outputs in the set i sup ∈ {3, 4, 5, 7, 8, 9, 12, 13}. It may be observed that the NPU of Fig. 3 has a similar structure to that of Fig. 2 , albeit with the removal of 9 subnodes across both summing and combining sections. Furthermore, this NPU supports a reduced level of run-time flexibility, demonstrated by the fact that i sup contains only a subset of i full ∈ {0, 1, 2, . . . , I}. Accordingly, Fig. 3 can omit the multiplexers at signals T 5 1 , out 10 , out 1 , and out 0 in Fig. 2 , in addition to those removed automatically alongside their corresponding subnodes. The set of multiplexers that can be removed in this way is identified using the relevance parameter r of Fig. 1 . For each multiplexer, this parameter represents the minimum value of i for which the bypassed result provided by that multiplexer (i.e., when i ≤ b) will be utilised in later stages. If i sup does not include any values of i for which r < i ≤ b, then the bypassed result will never be used, and hence the multiplexer is not required. In this case, the signal that would otherwise be provided by the multiplexer can be connected directly to the output of the corresponding subnode. 
Dual-Tree

IV. IMPLEMENTATION RESULTS
As discussed in Section I, the proposed dual-tree architecture may be used to implement a variety of NPUs, as determined by the function of its 2-input 1-output subnodes. Accordingly, the specific implementation characteristics of NPUs having the proposed dual-tree architecture depend on the characteristics of these subnodes. In order to facilitate a comparison and discussion in this section, we consider NPUs having subnodes which calculate the min (x, y) of two 4-bit unsigned inputs x and y, as used within fixed-point CNPUs performing the MSA 1 [13] . Since this function is non-invertible, the single binary tree structure mentioned in Section I is not viable. This motivates the use of one of the alternative general-purpose architectures discussed earlier, namely the proposed dual-tree, multi-tree [11] and forwards-backwards (FwdBwd) [15] architectures. Note that of these, only the dualtree architecture has inherent flexibility over the number of active inputs and outputs. However, in order to facilitate a fair comparison, the other architectures may be implemented to provide a "null" value of +7 on unused inputs to negate their effects, as described in [11] . Fig. 4 presents the hardware requirements and maximum operating frequencies (f max ) of CNPUs supporting a range of maximum numbers of inputs and outputs, 4 ≤ I ≤ 30, where each node flexibly supports all numbers of inputs and outputs i ≤ I, giving i sup = i full . Here, the hardware requirements and maximum operating frequencies are based on synthesis for an Altera Stratix IV EP4SGX530 FPGA, with hardware resource usage characterised using the Equivalent Logic Block (ELB) metric presented in [5] . In addition to the general results in Fig. 4 , Table I presents the measured characteristics of CNPUs that have been designed to support any CN within IEEE 802.11n [2] , having I = D C = 22, and supporting each number of inputs and outputs in the set i sup ∈ {7, 8, 11, 14, 15, 19, 20, 21, 22}. Table I also presents the hardware efficiency (f max / ELBs) of each CNPU.
As may be expected, the long critical path of the forwardsbackwards architecture causes it to have a significantly lower f max than the alternative architectures. However, Table I and Fig. 4 also indicate that its hardware resource requirement is slightly higher than that of the proposed dual-tree topology, despite requiring an equal number of subnodes. This may be explained by the distribution of the bypass multiplexers throughout the proposed dual-tree NPUs, allowing Comparison of fixed-point CNPUs constructed using Dual-Tree topology vs. two alternatives, for a range of input numbers I.
synthesis optimisations to combine their functions with unused inputs/outputs of the logic elements used by the subnodes. By contrast, the multiplexers that select between an input LLR and a null LLR in the forwards-backwards architecture cannot be similarly distributed and optimised.
The large number of subnodes required by the multi-tree architecture causes it to have a significantly higher hardware usage than the alternative architectures, as shown in Table I and Fig. 4 . However, it can also be seen that the multi-tree architecture has a lower f max than that of the proposed dual-tree architecture, despite having fewer subnode delays in its critical path. Again, this may be explained by the physical limitations of FPGA synthesis, in which a higher hardware resource usage creates longer routing paths, increasing the propagation delay.
Overall, it may be seen that the proposed dual-tree architecture offers a lower hardware resource requirement and a higher maximum operating frequency than both of the two general-purpose alternatives. This is demonstrated in its significantly higher hardware efficiency seen in Table I , which is 122% higher than that of the multi-tree architecture, and 248% higher than that of the forwards-backwards architecture.
V. CONCLUSION
In this brief, we have presented an algorithmic generalisation for the construction of optimised dual-tree NPUs, which flexibly support any number of inputs and outputs. This general-purpose architecture minimises the number of 2-input 1-output subnodes required and achieves a short critical path, making it suitable for use in any NPU in which the subnode function is non-invertible. Additionally, the proposed architecture uses bypass multiplexers to achieve full flexibility over the supported number of inputs and outputs, facilitating its use in decoder architectures in which the number of NPU inputs and outputs may vary at run-time. We also present a method for reducing the hardware requirements of these bypass multiplexers, according to the specific set of supported numbers of inputs and outputs. We have demonstrated that the proposed architecture offers better than the best of both worlds compared to two alternative general-purpose architectures. More specifically, the proposed architecture improves on the hardware resource usage of the forward-backwards architecture and improves on the maximum supported clock frequency of the multi-tree architecture in FPGA implementations. In the case of an IEEE 802.11n LDPC decoder, the proposed architecture offers a 122% improved hardware efficiency compared to the best of these two benchmarkers.
