In order to achieve high decoding throughput (hundreds of MBits/sec and above) for multiple code rates and moderate codeword lengths, several LDPC decoder solutions with different levels of processing parallelism are possible. Selection between these solutions is based on a threefold criterion: hardware complexity, decoding throughput, and error-correcting performance. In this work, we determine the multi-rate LDPC decoder architecture with the best tradeoff in terms of area cost, error-correcting performance, and decoding throughput. The prototype architecture of this decoder is implemented on an FPGA.
I. INTRODUCTION
The recent designs of LDPC decoders are mostly based on block-structured Parity Check Matrices (PCMs) that allow structured architecture solutions with different levels of processing parallelism (data throughput). The tradeoff between throughput and area for partially parallel LDPC decoders that support block-structured regular PCMs has been investigated in [7, 9] . However, there still remains a challenge to construct improved PCMs to preserve at the same time the architecture modularity and to approach the performance of irregular random codes. The scalable partially parallel decoder from [10] that supports three code rates is based on optimized structured regular and irregular PCMs, but achieves only moderate decoding throughput. On the other hand, fully parallel decoders such as in [5] can be very fast but the lack of flexibility and large area is a major disadvantage in wireless systems that require support for multiple code rates and codeword sizes. Authors in [6] propose a novel optimization of LDPC codes to achieve area reduction in a fully parallel decoder, however, the codes are not generally suitable for use in flexible decoder architectures.
In this paper, we investigate the complex tradeoff between the levels of architecture parallelism, area cost, and performance and target the design of flexible decoders. This analysis also requires special optimizations of block-structured PCMs. The specific target is to design a single decoder architecture that supports multiple code rates (1/2, 2/3, 3/4 and 5/6) and moderate codeword lengths (between 648 and 2592) with high data throughput (≈ 1GBits/sec) which is compatible with the IEEE 802.11n standard specifications [3] . The wide range of decoder solutions with different levels of processing parallelism are analyzed. The decoder solution with the best tradeoff between throughput, area and performance is chosen for hardware implementation. A prototype architecture is implemented on a Xilinx FPGA.
II. LOW DENSITY PARITY-CHECK CODES
An LDPC code is a linear block code specified by a very sparse PCM as shown in Fig. 1 with random placement of the nonzero entries. Each coded bit is represented by a PCM column (variable node), while each row of the PCM represents a parity check equation (check node). The log-likelihood ratios (LLRs) are used for representation of reliability messages in order to simplify the arithmetic operations: R mj message denotes a check node LLR sent from the check node m to the variable node j, L(q mj ) message represents a variable node LLR sent from the variable node j to the check node m, and L(q j ) messages (j = 1, . . . , n) represent the a posteriori probability ratios (APP messages) for all coded bits. The APP messages are initialized with the a priori (channel) reliability value of the coded bit j (∀j = 1, . . . , n).
The iterative layered belief propagation (LBP) algorithm is a variation of the standard belief propagation (SBP) algorithm [7] . The convergence rate is doubled because the scheduling of reliability messages is optimized [8] . As is shown in Fig. 1 , the typical block-structured PCM is composed of concatenated horizontal layers (component codes) and shifted identity sub-matrices. The belief propagation algorithm is repeated for each component code while the updated APP messages are passed between the sub-codes. For each variable node j inside the current horizontal layer, messages L(q mj ) that correspond to all check nodes neighbors m are computed according to:
For each check node m, messages R mj , for all variable nodes j that participate in a particular parity-check equation m, are computed according to:
where N (m) is the set of all variable nodes from the paritycheck equation m, and Ψ(x) = − log tanh |x| 2
. The a posteriori reliability messages in the current horizontal layer are updated according to:
If all parity-check equations are satisfied or the pre-determined maximum number of iterations is reached, the decoding algorithm stops. Otherwise, the algorithm repeats from (1) for the next horizontal layer.
III. LDPC DECODERS WITH DIFFERENT PROCESSING

PARALLELISM
The processing parallelism consists of two parts: the parallelism in decoding of concatenated codes (horizontal layers of the PCM), and the parallelism in reading/writing of reliability messages from/to memory modules. Several decoder solutions with different levels of processing parallelism are considered targeting hundreds of MBits/sec data throughput: 1. L1RW1: Decoder with full decoding parallelism per one horizontal layer (L1), reading/writing of messages from one nonzero sub-matrix in each clock cycle (RW1).
2. L1RW2: Decoder with full decoding parallelism per one horizontal layer, reading/writing of messages from two consecutive nonzero sub-matrices in each clock cycle (RW2).
3. L3RW1: Decoder with the pipelining of three consecutive horizontal layers (L3), reading/writing of messages from one nonzero sub-matrix in each clock cycle.
4. L3RW2: Decoder with the pipelining of three consecutive horizontal layers, reading/writing of messages from two consecutive nonzero sub-matrices in each clock cycle.
5. L6RW2: Decoder with double pipelining (the pipelining of six consecutive horizontal layers, L6), reading/writing of messages from two consecutive nonzero sub-matrices in each clock cycle.
6. FULL: Fully parallel decoder with the simultaneous execution of all layers and simultaneous reading/writing of all reliability messages.
The block-structured irregular codes have been proposed for the IEEE 802.11n standard [3] since they can approach errorcorrecting performance of excellent random codes. Inspired by these codes, we propose a systematic construction of irregular block-structured PCMs which allows two times higher memory access throughput. At the same time the error-correcting performance are preserved. An example of these novel blockstructured PCMs is shown in Fig. 1 .
The main constraint in the proposed PCM design is that any two consecutive nonzero sub-matrices from the same horizontal layer belong to odd and even PCMs' block-columns (see Fig.1 ). This property is essential for achieving more parallel memory access since the APP messages from two sub-matrices can be simultaneously read/written from/to two independent APP memory modules. The PCMs with 24 block-columns provide better performance than the PCMs with 18, 48, 72 or 96 block-columns. This is due to the near-optimal profile and small number of short cycles while supporting sufficiently high parallelism level crucial for high-throughput decoder implementations. Therefore, all analyzed decoder architectures including the fully parallel one support our novel PCMs with 24 block-columns. RAM block is utilized where each memory location contains the APP messages from one block-column. Therefore, one full sub-matrix of the PCM can be accessed in each clock cycle. The width of the valid memory content depends on the codeword length (size of the sub-matrix), and it is up to 108 messages for the largest supported codeword length of 2592. Meanwhile, each check node memory location also contains check node messages from one full nonzero sub-matrix. The architecture-oriented constraint of equally distributed odd and even nonzero block-columns in every horizontal layer allows for the read/write of two sub-matrices per clock cycle (used in solutions 2, 4 and 5). As is shown in Fig. 2 (part labelled as RW2), the APP memory is partitioned into two independent modules. Each location in one APP memory module contains APP messages that correspond to one (out of twelve) odd block-columns. Another APP memory module contains APP messages from the even block columns. Meanwhile, each check node memory location contains check node messages that correspond to two consecutive nonzero sub-matrices from the same horizontal layer. The entire content of the check node memory location is loaded/stored in a single clock cycle. It is important to observe that the dual-port RAMs can be replaced with single port RAMs if there is no pipelining of the horizontal layers. However, the memory organization does not depend on the number of memory ports. By construction, all rows inside a single horizontal layer are independent and can be processed in parallel without any performance loss. Every horizontal layer is decoded through three stages: memory reading stage, processing stage, and memory writing stage corresponding to Eqs. (1), (2) and (3), respectively. The decoding parallelism is increased if three consecutive horizontal layers are pipelined (label L3), which is visualized in Fig. 3 . The three-stage pipelining introduces a small performance loss due to the overlapping between the APP messages from the same block-columns but from the different horizontal layers.
The decoding parallelism is doubled if the pipelining of six consecutive layers is employed (label L6), as is shown in Fig. 4 . In order to avoid the reading/writing collisions (simultaneous reading and writing of APP messages from the identical blockcolumns but from two different horizontal layers), it is necessary to postpone the start of every second layer by one clock cycle (see Fig. 4 ).
IV. THROUGHPUT-AREA-PERFORMANCE TRADEOFF
ANALYSIS
The proposed decoder solutions with different processing parallelism levels presented in the previous section are compared. It is assumed that each solution (except the L6RW2 architecture) supports multiple code rates (from 1/2 to 5/6) and codeword lengths (648, 1296, 1944, and 2592) with novel blockstructured PCMs. The L6RW2 solution only supports code rates up to 3/4 since the PCM with 24-block-columns has only four horizontal layers for the code rate of 5/6 which is insufficient to fill the pipeline (so L6RW2 is at a disadvantage). all analyzed decoders: decoding parallelism and memory access parallelism increase from left to right (from L1RW1 to FULL as in section III.). The tradeoff is defined as a ratio between decoding throughput and decoder core area and therefore it is represented in MBits/sec/mm 2 . The decoding throughput is based on the average number of decoding iterations required to achieve a frame-error rate (FER) of 10 −4 for the largest supported code rate and codeword length. We assume 10 6 simulated codeword transmissions, the maximum number of iterations is set to 15 and the clock frequency is 200 MHz.
The hardware complexity is measured by the area in mm 2 assuming a 0.18 µm 6-metal CMOS process. The total area is computed as a summation of the arithmetic area and the memory area. The arithmetic part of each decoder is represented as a number of standard CMOS logic gates (also shown in Fig. 5 ), where it is assumed that the typical TSMC design density for 0.18 µm technology is 100K logic gates per mm 2 [1]. The memory size is given as the number of bits required for the storage of APP and check messages (SRAMs) and supported PCMs (ROMs). The corresponding memory area is computed by assuming the one-ported SRAM cell size of 4.65 µm 2 which is a typical size for 0.18 µm technology [1]. The dual-port SRAM blocks are required for implementation of the L3RW1, L3RW2 and L6RW2 decoders because the memory reading and writing stages are pipelined; the memory cell area is typically doubled for the same design rule [2] . The fully parallel solution does not have any inherent flexibility -the arithmetic logic for all 16 rate-size combinations is necessary which significantly increases the arithmetic area. However, the same set of latches can be utilized for the storage of reliability messages for all rate-size combinations.
If pipelining of only three layers is employed then the decoding throughput is increased by approximately three times while the arithmetic area is increased only marginally. On the other hand, the memory area is doubled since then both mirror APP and mirror check node memories are required. Overall, three- stage pipelining significantly improves the throughput/area ratio (see Fig. 5 , L1RW1 vs. L3RW1 architecture, and L1RW2 vs. L3RW2 architecture). If the memory access parallelism is doubled then the decoding throughput is directly increased by more than 50%, while the arithmetic area is doubled and memory size remains the same (see Fig. 2 ). Overall, if only the memory access parallelism is increased, then the throughput/area ratio is only slightly improved (see Fig. 5 , L1RW1 vs. L1RW2 architecture, and L3RW1 vs. L3RW2 architecture).
The further increase of decoding parallelism (L6RW2 and FULL solutions) does not improve the tradeoff ratio since the throughput improvements are smaller than the necessary increase in area. A similar effect is expected if the memory access parallelism is further increased; if four sub-matrices per clock cycle are accessed (not shown here since it requires special redesign of the PCMs) then the arithmetic area would increase two times. However, the decoding throughput would improve by only about 25% since the processing latency will start to dominate compared to the memory access latency.
It can be observed from Fig. 5 that the best throughput per area ratio is achieved for the three-stage pipelining approach with memory access that allows reading/writing of two submatrices per clock cycle (the L3RW2 solution). At the same time, the performance loss due to the pipelining is acceptable (about 0.1dB for the FER around 10 −4 , see Fig. 6 , 10 6 codeword transmissions are simulated). In order to avoid the performance loss, the double pipelining and the SBP approaches would require additional decoding iterations which further decrease the decoding throughput.
V. THREE-STAGE PIPELINED LDPC DECODER WITH DOUBLE-PARALLEL MEMORY ACCESS
An LDPC decoder with three-stage pipelining and a memory organization that supports access of two sub-matrices during a single clock cycle (L3RW2) is chosen for hardware implemen- The support for different code rates and codeword lengths implies the usage of multiple PCMs. Information about each supported PCM is stored in the corresponding ROM module where a single memory location contains the block-column position of one nonzero sub-matrix and the associated shift value. The block-column position represents the reading/writing address of the appropriate APP memory module. In order to avoid the permutation of APP messages during the writing stage, the relative shift values (difference between two consecutive shift values of the same block-column) are stored instead of the original shift values.
The support of multiple code rates is insured by the control unit. The number of nonzero sub-matrices per horizontal layer varies with the code rate which affects the latencies of reading/writing stages, as well as the processing latency within the DFUs. The control logic provides the synchronization between the pipelined stages. It also handles the exemptions which occur if the number of nonzero sub-matrices in the horizontal layer is odd. In that case, only one block-column per clock cycle is read/written from/to odd or even APP memory module. The full content of the corresponding check node memory location (width of two sub-matrices) is loaded even though the second half of it is not valid. The control unit then disables the appropriate part of the arithmetic logic inside the DFUs. Figure 9 shows the FER performance of the implemented flexible decoder for the case of a 2/3-rate code and a codeword length of 1296. The arithmetic precision of seven bits is chosen for representation of reliability messages (two's complement with one bit for the fractional part) as well as for all arithmetic operations. A prototype architecture has been implemented using Xilinx System Generator and targeted to a Xilinx Virtex4-XC4VFX60 FPGA. Table 1 shows the utilization statistics. Based on the XST synthesis tool report and full place and route statistics, a maximum clock frequency of 130 MHz can be achieved which determines the maximum average throughput of approximately 1.1 GBits/sec.
VI. CONCLUSION
In this paper, the multi-rate decoder architecture that represents the best tradeoff in terms of throughput, area, and performance is found and implemented on an FPGA. An identical tradeoff analysis can be applied for a wide range of code rates and code- word lengths. We believe that the pipelining of multiple horizontal layers combined with a sufficiently parallel memory organization is a general tradeoff solution that can be applied to other block-structured LDPC codes.
