Abstract-This paper describes a multi-mode LDPC decoder which supports 19 block lengths and 6 code rates of Quasi-Cyclic LDPC code for Mobile WiMAX system. To achieve an efficient implementation of 114 operation modes, some design optimizations are considered including block-serial layered decoding scheme, a memory reduction technique based on the min-sum decoding algorithm and a novel method for generating the cyclic shift values of parity check matrix. From fixed-point simulations, decoding performance and optimal hardware parameters are analyzed. The designed LDPC decoder is verified by FPGA implementation, and synthesized with a 0.18-m CMOS cell library. It has 380,000 gates and 52,992 bits RAM, and the estimated throughput is about 164 ~ 222 Mbps at 56 MHz@1.8 V
I. INTRODUCTION
With the increasing demands for high data rate wireless and multimedia applications, forward error correction (FEC) coding schemes become more important for reliable communication. Recent years, lowdensity parity-check (LDPC) codes, which were first proposed by R. Gallager [1] in the early 1960's and rediscovered by MacKay and Neal [2] , have been receiving a lot of attention due to their remarkable error correction capabilities near the Shannon's limit. Many current and next generation communication standards such as WLAN (IEEE 802.11n) [3] , Mobile WiMAX (IEEE 802.16e) [4] , DVB-S2 [5] and 10GBaseT (IEEE 802.3an) have adopted or are considering the use of LDPC codes. Recently, there are many interesting research works on LDPC codes and decoders, including code construction, decoding algorithm and architecture, multi-mode or/and multi-standard design [6] [7] [8] [9] .
LDPC codes can be efficiently decoded by belief propagation (BP) algorithm or sum-product (SP) algorithm [10, 11] . It was originally formulated in the form of twophase decoding, which is based on iterative exchanges of messages between check nodes and variable nodes on Tanner graph. Although the SP algorithm has highprecision message passing and excellent error-correcting performance, it is well known that hardware implementation is inefficient because its complex computations require large hardware. To reduce the complexity of the check node operation of the BP algorithm, min-sum (MS) algorithm [12, 13] was introduced. Since MS algorithm has lower computation complexity with little scarification of performance, it is preferred for hardware implementation of LDPC decoder. The layered decoding of SP and MS algorithms is possible, which has a decoding schedule based on layer sequence. It has the advantages of faster convergence speed and lower hardware complexity than the two-phase decoding.
LDPC decoder architectures can be classified into three categories: fully parallel, serial and partly parallel. LDPC decoder design has a tradeoff between error-correction performance, hardware complexity and decoding throughput. Although various decoding algorithms and architectures for LDPC codes have been proposed until recent years, there are many challenges including flexible architecture for multi-mode and multistandard operations, memory requirements and hardware complexity for high throughput.
In this paper, we propose a design technique to reduce the size of check node memory of MS-based layered LDPC decoder. In addition, a decoder architecture supporting 114 operating modes for 19 code lengths and 6 code rates of Mobile WiMAX LDPC codes and a prototype design are presented.
This paper is organized as follow: Section II briefly introduces Quasi-Cyclic (QC)-LDPC code for Mobile WiMAX system and layered decoding algorithm. In Section III, we describe our LDPC decoder architecture and memory reduction technique with comparisons. The implementation results of our LDPC decoder are given in Section IV.
II. LDPC CODES AND DECODING ALGORITHMS

LDPC Code Structure of Mobile WiMAX System [4]
The QC-LDPC code of the Mobile WiMAX system is defined by a parity-check matrix (PCM) H of size MN where N is code length and M is the number of paritycheck bits in the code. The PCM H is expanded from a binary base matrix H b of size m b n b where m b =M/z f and n b =N/z f with expansion factor z f . Each LDPC code in the set of LDPC codes for Mobile WiMAX is defined by a PCM H as
where P i,j is either one of a set of z f z f circularly rightshifted identity matrices (i.e., permutation matrices) or a 
The PCM can be effectively represented by a bipartite graph called Tanner graph as shown in Fig. 1 . It has two types of nodes, check node (CN) and variable node (VN). The VNs represent the bits of a codeword and the CNs implement the parity-check equations. Each of CNs is connected to VN, where the elements of each row in the PCM are "1"s. 
Decoding Algorithms
LDPC codes can be effectively decoded using message passing algorithm which iteratively exchanges softinformation between each side of Tanner graph. The exchanged messages are the log-likelihood ratios (LLR) of the received bits in the codeword, defined as Eq. (3) where x and y are the transmitted codeword and the received codeword, respectively. During decoding process, the LLRs, which measure the reliability of the received bits based on channel observation, are propagated and updated between VNs and CNs until the decoded information bits satisfy check matrix constraint.
The standard BP algorithm can be decomposed into two phases, CN processing and VN processing as follow:
i) CN processing: at the q-th iteration, each CN receives messages from the connected VNs, and computes CN updating message as Eq. (4) 
ii) VN processing: at the q-th iteration, each VN receives messages from the connected CNs, and computes VN updating message as Eq. (6
In addition, each VN computes a refined estimation on the transmitted bit that is a posteriori probability (APP) by adding up the extrinsic information of all connected CNs to the channel value i F as Eq. (7). The sign of q i z can be interpreted as the hard decision on the received bit.
The arithmetic complexity of the SP algorithm can be greatly reduced by MS algorithm using an approximation
With the MS algorithm, the CN processing defined in Eq. (4) can be written as
The function ) (x  , which is typically implemented as look-up table (LUT) in hardware, can be replaced by finding a minimum value in all VN messages, thus computational complexity is significantly reduced. Additionally, it is well known that the MS algorithm is generally less sensitive to quantization errors than SP algorithm, and hence smaller finite word-length can be used to reduce logic complexity.
The layered belief propagation (LBP) algorithm [14] , which is a variation of the standard BP algorithm, treats the PCM as a group of concatenated horizontal layers. Each layer represents a row of sub-matrices in Eq. (1). The LBP algorithm repeats the decoding of each horizontal layer, and updates APP messages to be passed to next layer. Algorithm 1 shows a pseudo code of 
Iteration consists of L sub-iterations, and sub-iteration corresponds to one layer processing. In Algorithm 1, the CN processing computes Eq. (8) and updates the check-to-variable messages with two smallest magnitudes (min0 and min1) and the sign bits of variable-to-check messages. The VN processing, which is composed of two sub-operations, computes the APPs given in Eq. (7) and updates the variable-to-check messages. It is well known that the layered decoding can reduce the number of iterations since the latest extrinsic messages are passed to and are employed by subsequent layers within the current iteration.
III. LDPC DECODER ARCHITECTURE
This section describes the proposed LDPC decoder architecture supporting 114 operation modes of QC-LDPC codes for Mobile WiMAX system, which implements the layered MS decoding algorithm shown in Algorithm 1. Design considerations to achieve efficient implementation are as follows: (i) block-serial scheduling is adopted for multi-mode operation and small area. The word-length of messages is determined by fixedpoint simulation considering tradeoff between hardware complexities and decoding performance. Fig. 2 shows the top-level architecture of our LDPC decoder which is composed of five parts: four banks of decoding function unit (DFU), CN memory, APP memory, H-ROM & SVG for storing and generating shift values, and a permuter block. Since the sub-matrix size z f is defined by z f =24+4f (0f18) ranging from 24 to 96, a high degree of flexibility is required in decoder architecture. Our decoder exploits a partially parallel architecture which processes one sub-matrix in a clock cycle using 96 DFUs. Since the parallelism factor z f varies from 24 to 96 with an increment of 4, the 96 DFUs are grouped into four banks and each bank consists of 6 sub-groups of 4 DFUs which are selectively activated according to the parallelism factor z f . It results in a simple control as well as a reduction of overall power consumption by deactivating the banks and sub-groups that are not being used. Each DFU is independent from all others since there is no data dependence between adjacent CNs. The CN memory stores CN updating messages to be used in the next iteration processing, and the APP memory holds APP messages to be used in the next layer processing.
Overall Architecture
As mentioned in Section I, the QC-LDPC code of Mobile WiMAX has 114 operation modes for supporting 19 code lengths and 6 code rates, and sub-matrices of a PCM corresponding to an operating mode have particular circular shift values as defined in Eq. (2) . Since the Mobile WiMAX standard defines only 6 base model PCMs for the largest code length (N=2304), we need to generate the rest 108 PCMs using Eq. (2). As is well known, the division, floor function x and modulo function in Eq. (2) require complicated hardware. In this paper, we devised an efficient circuit for generating shift values of 108 PCMs as shown in Fig. 3 direct implementation using LUTs in [15] and [17] .
Decoding Function Unit
Fig . 4 shows the data path of DFU that implements the layered MS decoding algorithm described in Algorithm 1. It consists of SM-TC and TC_SM blocks converting sign-magnitude to 2's complement and vice versa, a min_detector finding two minimum values, an adder, a subtractor, and a FIFO. Each DFU reads CN messages of the previous iteration from CN memory and the APP messages of the previous layer from APP memory, calculates VN messages, and then finds minimum (min0) and semi-minimum (min1) values from VN messages for each layer over all VNs. The min0 and min1 are the new CN messages to be used to update VN. It also computes the new APP messages by adding the new CN messages to the current VN messages. These new CN and APP messages are stored in CN memory and APP memory, respectively, so that they are used in the next iteration and in the next layer processing.
Check Node Memory
The word-length of the messages processed in DFU and stored in memories influences the hardware costs of DFU and memory requirements. The APP memory stores messages of one layer, but CN memory needs to store CN messages of the entire layers. Therefore, CN memory requires a large hardware overhead. In this paper, we focus on a technique to reduce efficiently the size of CN memory. Fig. 5(a) shows the basic structure of conventional CN memory which stores the entire CN messages of L layers. With fixed-point word-length of wbits, the total size of CN memory is (wz f S l L) bits, where S l denotes the number of non-zero sub-matrices in a layer. For the code length 2304 and code rate 1/2 with word-length of w=8 bits, the size of CN memory becomes 64,512 bits. To reduce the CN memory size, we propose a new memory structure as shown in Fig. 5(b) . Note that each layer consists of S l non-zero sub-matrices whose size is z f z f . It means that each layer can be considered as a matrix having z f rows by z f S l columns, where S l denotes the number of non-zero sub-matrices in a layer. From the CN processing of Algorithm 1, each layer has z f min1s and z f  (S l -1) min0s . The key idea of our memory reduction comes from the fact that there is no need to store all the z f ( S l -1) min0s since they have the same value. Therefore, we can store only z f min0s rather than z f (S l -1) min0s for each layer. As shown in Fig. 5(b) , our CN memory stores z f Mag_min0s and Mag_min1s with (z f S l ) SM data for each layer. The Mag_min0 and Mag_min1 represent the magnitudes of min0 and min1, respectively. The 2-bit SM indicates the sign (i.e., positive or negative) and the type of minimum (i.e., min0 or min1). The basic concept for storing the CN messages in a compressed way is similar to the method in [17] [18] [19] , but our CN memory structure and implementation are different from them. Since CN memory does not store all the min0s, we need to restore the 2's complement value of each CN message using the Mag_min0s, Mag_min1s and SM information stored in the CN memory. SM_TC block as shown in Fig. 4 is used in DFU to convert the sign-magnitude value to 2's complement value. The hardware overhead of the SM_TC block is trivial when compared to the amount of CN memory reduction. Table 2 shows a comparison of CN memory sizes for code length 2304 and code rate 1/2 for Mobile WiMAX. For word-length of w=8 bits, the proposed method requires only 34,560 bits, which reduces CN memory by 46% compared to the conventional method. Fig. 6 compares the CN memory sizes for various code rates and code lengths of Mobile WiMAX system. Note that much higher memory reduction is obtained for the higher code rate.
HROM
The QC-LDPC code for Mobile WiMAX system has 6 base model PCMs which are stored in HROM and used for generating shift values of other 108 PCMs. Most of the sub-matrices in base model PCMs are zero matrices. For example, the PCM for code length 2304 and code rate 1/2 has 212 zero sub-matrices out of 288 submatrices. From this observation, we can reduce HROM size by storing only non-zero sub-matrices as depicted in Fig. 7(b) , instead of storing all the sub-matrices as shown in Fig. 7(a) . In Fig. 7 , N s denotes the total number of submatrices in a PCM and S l denotes the total number of non-zero sub-matrices in a layer. In our method, the shift values and positions of non-zero sub-matrices are stored with 10~12 bits, thus we can reduce the HROM size by 17% compared to the conventional method storing all the sub-matrices including zero sub-matrices. 
Fixed-point Simulation
For hardware implementation, it is important to decide an optimal word-length of messages since there are a tradeoff between hardware cost and bit-error-rate (BER) performance. Fig. 8 shows the fixed-point simulation results of our decoder for various word-lengths of messages. In the fixed-point simulation, code length 2304 and code rate 1/2 was chosen and the maximum number of iteration was fixed to 8. As shown in Fig. 8 , the BER performance is very poor when the word-length of integer part is less than 5 bits, and the BER performance has trivial difference for the word-lengths of integer part greater than 5 bits. Based on this analysis, word-length of 8 bits (5 bits for integer part and 3 bits for fractional part) was chosen for our LDPC decoder. Fig. 9 shows the fixed-point simulation results for various code lengths and code rates with maximum iteration set to 8.
IV. IMPLEMENTATION RESULTS
The LDPC decoder was implemented as a synthesizable Verilog HDL model, and its decoding performance was evaluated by simulation and FPGA implementation. As shown in Fig. 10 , test vectors in the range of Eb/No= 1.5~3.0 dB with 0.3 dB step are generated and decoding performance is analyzed using Matlab. Fig. 11 shows BER performance of the decoder for code length 2,304 and code rate 1/2, which was obtained by functional simulation of Verilog HDL model with maximum iteration set to 8. Fig. 12(a) shows the setup using Xilinx XC5vx50t-1ff1136 device to verify the decoder. RS232 transceiver and wrapper modules are embedded on FPGA along with the LDPC decoder to interface with RS232 port. Test data generated by Matlab are sent to FPGA with appropriate control signals for decoding, and the decoded data obtained from FPGA are used to analyze the decoding performance. Fig. 12(b) shows a part of FPGA verification results, indicating that the decoded output from FPGA are identical with the functional simulation results, thus the designed LDPC decoder works correctly.
The decoder synthesized using a 0.18-m cell library has 380,000 gates and total 52,992 bits RAM including 18,432 bits of APP memory. Since our decoder processes one sub-matrix in a clock cycle, it requires total ((S l +1)L) clock cycles to finish one iteration, where S l denotes the number of non-zero sub-matrix in a layer and L denotes the number of layers. The estimated throughput is 164~222 Mbps at 56 MHz@1.8 V. Table 3 compares our decoder with the state-of-the-art LDPC decoders for Mobile WiMAX. Note that our decoder requires the smallest memory and comparable gate counts. 
V. CONCLUSIONS
In this paper, a multi-mode LDPC decoder supporting 19 code lengths and 6 code rates for Mobile WiMAX system is described. It adopts block-serial architecture which processes a sub-matrix of z f  z f in parallel using 96 DFUs grouped into four banks to support 114 operation modes. A novel memory reduction technique which results in a significant reduction of CN memory and HROM compared to conventional approach is also exploited. The design techniques of this paper can be applied to any other QC-LDPC code decoders including IEEE 802.11n and DVB-S2 systems. 
