ABSTRACT
INTRODUCTION
Low density parity check (LDPC) codes have received major attention in the research community in recent years because of their excellent error correction capability and performance. Several architectures have been proposed for LDPC decoders [2] [4] . Most of these architectures support just one specific type of LDPC codes or a family of codes with a fixed block length and code rate. At Rice University, a flexible high throughput LDPC decoder that supports a family of codes with different block lengths and code rates has been proposed in [5] . In this work we propose a joint LDPC encoder and decoder based on the IEEE 802.11n draft specification [1] . The encoder/decoder pair supports twelve combinations of block lengths 648, 1296, 1944 bits and code rates 1/2, 2/3, 3/4, 5/6. The parity check matrices of these codes are irregular block-structured codes as defined in [1] . The layered belief propagation algorithm [7] is used in our design because it converges twice as fast as the standard belief propagation algorithm resulting in twice the throughput. A prototype of the encoder/decoder architecture is implemented in Verilog HDL and tested on FPGA and also synthesized on 0.13 µm ASIC. The logic synthesis report shows a better performance in terms of area efficiency and throughput than the currently reported works on IEEE 802.11n LDPC decoder [5] [6].
EFFICIENT LDPC ENCODER
An LDPC code is a linear block code specified by a very sparse parity check matrix (PCM). LDPC codes are usually represented by a bi-partite graph in which a variable node corresponds to a 'coded bit' or a PCM column, and a check node corresponds to a parity check equation or a PCM row. There is an edge between each pair of nodes if there is a 'one' in the corresponding PCM entry. In a general analysis an (n, k) LDPC code has k information bits and n coded bits with code rate r = k/n. The parity-check matrix H is of dimension (n − k) × n, and it defines a set of equations.
, where s is the k information bits and p is the n − k parity bits. From (1), we have
High encoding complexity arises from the high density of H −1 2 [8] . However, for the IEEE 802.11n proposed check matrix, H 2 has a simple deterministic structure, and encoding can be performed recursively. As shown in Fig. 1 , H 2 is an m × m array of S × S sub-matrices, where S could be 27, 54 or 81 depending on the different code lengths. The
T satisfies that 1) h i is either a S × S zero matrix or a S × S shifted identity matrix and 2) m−1 i=0 h i = I s×s (mod 2). Based on IEEE 802.11n [1] , H1 consists of a m × b array of S × S sub-matrices, which are either zero or shifted identity matrices. Given a block of information bits s, if we decompose s into 1 × b array of 1 × S sub-matrices, and also decompose parity bits p into 1 × m array of 1 × S sub-matrices, from (2) and H 2 , we can prove that
Since H 1 consists of either zero or shifted identity submatrices, H 
LDPC SOFT DECODING ALGORITHM
The decoder architecture proposed in this paper utilizes the iterative layered belief propagation (LBP) algorithm as defined in [7] . Fig. 3 shows a block-structured parity-check matrix, which is a D × B array of S × S sub-matrices, each sub-matrix is either a zero or a shifted identity matrix with random shift value. In every layer, each column has at most one 1, which satisfies that there are no data dependencies between the variable node messages, so that the messages flow in tandem only between the adjacent layers. The block size S could be 27, 54 or 81 corresponding to the code lengths-648, 1296 and 1944 respectively [1] . Let L(q mj ) denote the variable node log likelihood ratio (LLR) message sent from variable node j to the check node m, then:
in which R mj is the check node LLR message sent from the check node m to the variable node j and L(q j ) (j = 1, . . . , N) represent the a posteriori probability ratio (APP) for all variable nodes. The APP messages are initialized with the channel reliability values of the coded bits. N (m) is the set of all variable nodes connected to the check node m. To simplify the hardware implementation of the nonlinear function ψ(x), updating of the check node messages in (5) is replaced with the modified min-sum approximation [3] . According to this solution, the updating of check node messages in the mth row of the PCM is determined as:
where β is a correcting offset equal to a positive constant. With a properly chosen β, the modified min-sum approximation exhibit only about 0.1 dB of degradation in performance [3] . Hard decisions can be made after every horizontal layer based on the sign of L(q j ). If all parity-check equations are satisfied or the pre-determined maximum number of iterations is reached, then the decoding algorithm stops. Otherwise, the algorithm repeats from Eq. (4) 
Pipelined decoder architecture
In this section, we propose a pipelined decoding algorithm as well as a hardware implementation. Fig. 5 shows a two-stage pipeline decoding. Instead of waiting for layer i to update all the APP node messages, the next layer i+1 can start to read APP node messages slightly after layer i begins to update new APP node messages. Due to the random locations of non-zero sub-matrices in each layer, it might have a pipeline hazard which is shown in Fig. 6 as an example. The cross signs in Fig. 6(a) indicate non-zero sub-matrices. As shown in Fig. 6(b) , at clock cycle 6, layer i + 1 is trying to read memory location 3, which will not be updated until clock cycle 8 (we assume 1 clock cycle SRAM write latency). In order to avoid memory conflicts, two memoryread stalls are inserted at clock cycle 6 and cycle 7. A hardware implementation of this pipelined DFU is shown in Fig. 7 . A local FIFO will be needed to buffer the next layer's data while processing the current layer's data. The proposed pipeline decoding can increase the overall throughput by about 1.5 to 2 X, depending on the code rates. 
DECODER ASIC IMPLEMENTATION AND PERFORMANCE ANALYSIS
The LDPC Decoder architecture was implemented in Verilog HDL and synthesized on a TSMC 0.13µm standard cell library. Table 1 shows a summary of synthesis results. Complexity is measured in equivalent gates for logic and in bits for memories. An overall complexity of 90 K logic gates is measured for non-pipelined implementation, plus 77, 760 bits RAM. While 195 K logic gates is measured for pipelined implementation, plus 77, 760 bits memories.
A Verilog RTL simulation model was used to measure average throughput v.s. SNR level. For instance, at a rather low SNR 1.0 dB, the pipelined decoder can achieve 150 Mbps. While at SNR 2.2 dB, the pipelined decoder can achieve about 1 Gbps. 
LDPC ENCODING/DECODING TESTING ON WIRELESS TESTBED
In order to explore LDPC encoding and decoding performance, we have been conducting over the air OFDM experiments using the Rice Wireless Open Access Research Platform. As shown in Fig. 8 , the Rice WARP Platform (http://warp.rice.edu) is reconfigurable and consists of FPGA baseband processors along with multiple attached 2.4 GHz radio subsystems, which enables quick prototyping of wireless communication algorithms. Proposed LDPC encoding and decoding is currently being tested on the WARP platform. 
CONCLUSION
We have presented a high throughput parallel LDPC decoder and an efficient LDPC encoder based on the IEEE802.11n standard. The encoder/decoder is based on block structured irregular codes that can be extended to support other code lengths and code rates. The LDPC encoder and decoder is being implemented in FPGA and tested on our wireless testbed. Future applications will be LDPC real time encoding and decoding for MIMO OFDM.
