In this paper, a (1200,720) LDPC decoder based on an irregular parity check matrix is presented. For achieving higher chip density and less critical path delay, the proposed architecture features a new data reordering such that only one specific data bus exists between message memories and computational units. Moreover, the LDPC decoder can also process two different codewords concurrently to increase throughput and datapath efficiency. After chip implementation, a 3.33Gb/s data rate is achieved with 8 decoding iterations in the 21.23mm 2 0.18µm silicon area. The other 0.13µm chip with the 10.24mm 2 core can further reach a 5.92Gb/s data rate under 1.02V supply.
Introduction
Low-density parity-check (LDPC) code first introduced by Gallager [1] has engaged much research interest after its rediscovery [2] . The sparse property of parity check matrix H makes the decoding algorithm simple and practical at good communication rates [2] . And its performance can be designed to approach Shannon limits with large block lengths like turbo code [3] . However, LDPC decoders based on sum-product algorithm (SPA) are capable of parallel implementation that leads to a much higher decoding speed than turbo decoders. Hence the inherent dominance of LDPC code can enhance system performances for high speed wireless communications. LDPC code is often represented by bipartite graph [4] where N bit nodes and M check nodes are connected by edges according to M ¢ N matrix H. Fig. 1 shows an example with 10 bit nodes and 6 check nodes. The column weight of H determines the number of edges (or degree) for each bit node to connect to check nodes, while the row weight of H determines the connections for each check node. A LDPC code with equal weights for columns and rows is a regular code and otherwise is termed irregular. It has been shown that irregular codes can outperform those based on regular graphs [5] . However, the irregularity causes signal congestion problems during VLSI implementation and requires more area for routing wires. The fully parallel implementation in [6] demands large area to accommodate interconnections and has only 50% chip density. The partially parallel architecture also suffers from the same complications, and may be even worse due to multiplexers. Furthermore, the critical path delay bit nodes check nodes edge Figure 1 : The bipartite graph of (10,4) LDPC code.
induced by global routing damages the maximum achievable throughput. In this paper, a more efficient method that schedules the data flow and eliminates multiplexers for less signal routing is proposed and applied to a (1200,720) LDPC decoder. The implementation results show the decoder can achieve above 70% chip density and 3.33Gb/s throughput with 8 decoding iterations. This paper is organized as follows. The decoding algorithm is described in section 2.. And section 3. presents the decoder architecture. The chip implementation is shown in section 4.. Finally, the conclusion is also given in section 5..
Decoding Algorithm
The decoding of LDPC codes is based on sum-product algorithm (SPA) [2] , or belief propagation (BP) algorithm, which iteratively updates the a posteriori probabilities of bit nodes. In logarithmic domain, the probability values are often represented by log-likelihood ratio (LLR) L´xµ log P x 0 P x 1 for less computational complexity. The decoding procedure consists of two steps, horizontal step (H-step) and vertical step (V-step). In horizontal step, the message sent from check node c j to bit node b i can be expressed by
where B´jµÒb i is the set of bit nodes connected check 
where´iµÒc j is the set of check nodes connected to b i except c j , and L´b i µ is the channel value. The circuit to implement´1µ can be a table look approach [6] . An alternative is the sub-optimal expression of´1µ which is
This approximation results in a significant reduction of hardware complexity but causes a little performance degradation. The iterative decoding flow chart is illustrated in Fig. 2 , where the syndrome check identifies the valid codeword by calculating M parity checks from estimated data.
Syndrome Check
Horizontal
Step
Vertical
Step 
Architecture Design
In Fig. 2 , H-step and V-step are linked in a loop to decode valid codewords iteratively. The computational units for H-step and V-step are check node unit (CNU) and bit node unit (BNU). Since the partially parallel architecture is adopted for less computational units, each iteration requires 4 cycles by dividing the 480 ¢ 1200 parity check matrix H into four 240 ¢ 600 sub-matrixes, which is shown in Fig. 3 . The proposed LDPC decoder architecture illustrated in Fig. 4 contains the input buffer, 240 CNUs, 600 BNUs, and two dedicated message memory units (MMU). The set of data processed by CNUs are h 00 h 01 and h 10 h 11 , whereas the data fed into BNUs should be h 00 h 10 and h 01 h 11 . Note that MMUs can reorder the data sequence such that only one specific data bus exists between MMUs and datapaths. Hence many multiplexers can be eliminated in contrast with other approaches. In MMU-1, subblocks B, C, D, and E capture the outputs from CNUs while sub-blocks A and C deliver the message data to BNU after reordering. The extra sub-block E is used as temporal memory for reducing the interconnection between other sub-blocks. And the MMU-0 is analogous to MMU-1 but in a reverse direction. Fig. 5 shows the detailed timing diagram of reordering data sequence in MMU-1. The inputs of BNUs sequentially appear in sub-blocks A and C after reordering the data from CNUs. The solution to switch data sequence also enables the decoder to process two different codewords concurrently without stalls. Therefore, the LDPC decoder is not only area-efficient but the decoding speed is compatible with the fully parallel architecture in [6] . The other components in Fig. 4 will be detailed in the following subsections. 
Input buffer
Input buffer is also a storage component that receives and keeps channel values for iterative decoding. Channel values should be fed into CNUs during initialization and BNUs while performing V-step. For this design, the input buffer is connected to BNU and provides channel values The timing diagram of input buffer is presented in Fig. 7 . Channel values of two different codewords are shifted within the buffer ring as shown in Fig. 6 . Therefore, only bu f -3 is connected to BNU, leading to a much simple signal transfer.
Check Node Unit
Based on´3µ, the CNU can be implemented by a sorter that search the minimum magnitude. The sorter can be further modified to enhance the decoding speed by simultaneously updating all edges connected to the same check node. Fig. 8 illustrates the proposed CNU architecture where all the 6-bit inputs are represented as signmagnitude notation. The CNU can be divided into two parts: one is 1-bit sign-multiplication and the other is 5-bit sorter that search the minimum value and the second minimum value from the inputs. The new message for each bit node is a combination of the sign bit according tó 3µ and the new magnitude which is either "min" or "2nd min" of the sorter. Note that if C1, C2, and C3 are set to zero, the channel value input will be directly bypassed to the outputs of BNU. This will produce a path for channel values to reach CNUs to complete initialization that requires four clock cycles.
Bit Node Unit

Chip Implementation
A test chip has been fabricated in a 1.8V, 0.18µm 1P6M CMOS technology. The chip size is 25 mm 2 while the core occupies 21.23mm 2 . The total gate count is 1.15M including two MMUs while the chip core density is about 71.2%. After static timing analysis (STA) and post-layout simulation, the decoder achieves 3.33Gb/s throughput with 8 decoding iterations under 1.62V supply and worst speed corner. The estimation also includes crosstalk analysis for signal wires that cause coupling noise. Furthermore, the (1200,720) LDPC code has been applied to MB-OFDM UWB system [7] . The fixed-point performance in AWGN channel is also shown in Fig. 10 . The word lengths of
Proceedings of ESSCIRC, Grenoble, France, 2005
channel value and message value are 5 bits and 6 bits respectively. A second test chip is implemented in a 1.2V, 0.13µm 1P8M CMOS technology. The maximum decoding speed has been improved to 5.92Gb/s with 8 decoding iterations. And the chip size becomes 13.5mm 2 where the core constitutes 10.24mm 2 . Moreover, the chip density grows to about 75.4% because of two extra metal layers. Fig. 11 shows the 0.13µm chip layout and the distribution of MMUs. The uniformity of MMU distribution minimizes the routing congestion issues particularly. Table 1 summarizes the LDPC decoder chips in this paper and makes a comparison with the fully parallel design in [6] . 
Conclusion
The high speed and area efficient LDPC decoder is presented. The message memories architecture permits parallel decoding of two codewords and diminishes the routing congestion issues. In addition, the data rescheduling minimizes the signal routing between datapaths and memory units. Consequently, the chip becomes smaller due to the the increased chip density. After implementation in 0.18 µm technology, the chip can achieve the 3.33Gb/s data rate with 8 decoding iterations. Furthermore, the 0.13µm chip reaches the maximum 5.92Gb/s data rate with 13.5mm 2 area and 268mW power consumption.
Acknowledgments
