This paper presents the design of low complexity LDPC codes decoders for the upcoming WiFi (IEEE 802.11n 
Introduction
The increasing demand of high data rate and reliability in modern communication systems is pushing next-generation standards toward error correction schemes allowing high throughput decoding with near Shannon limit performance.
At present, Low-Density Parity-Check (LDPC) codes [9] are among the best candidates to meet these requirements. However, first works on hardware implementations [1] pointed out the huge complexity associated to a LDPC decoder even for short length codewords. The peculiarities of the decoding algorithm (iterative processing, transcendental operators, pseudo-random message exchange) strongly affect traditional VLSI systems metrics (area, speed, power) making it difficult to meet feasible implementation requirements without spoiling communication performance [4] .
Resorting to joint code-decoder design techniques [17] has become a common practice to ease the implementation of high data rate decoder architecture. Indeed all upcoming standards featuring the use of LDPC codes as WiFi (IEEE 802.11n) [12] , WiMax (IEEE802.16e) [13] and DVB-S2 [7] , adopt architecture-aware LDPC codes. Despite the codesign approach, the need for a further reduction in complexity is still an appealing issue, specially when coping with the variety of code lengths and rates exhibited by the above mentioned standards.
After reviewing the state-of-the-art for low complexity design (alternative schedules, node processing approximation), the paper focuses on how these techniques are used on actual decoder implementations targeting next-generation standards. In particular, an analysis on the LDPC codes for WiFi, WiMax and DVB-S2 is carried on highlighting the issues and the complexity overhead related to the complete coverage of these standards. Moreover, in line with the decoder-aware techniques used for these standards, the introduction of an optional LDPC codes for an OFDM-UWB system featuring a very low complexity decoder implementation is discussed.
Fundamentals of LDPC Decoding
LDPC codes are linear block codes characterized by a sparse binary matrix H, called parity-check matrix. The set of valid codewords C is defined as:
The code can also be described by means of a bipartite graph, known as Tanner graph (see Figure 1) . A Tanner graph is made up of two entities, variable nodes (VN) and check nodes (CN), connected each other through a set of edges. An edge links the check node i to the variable node j if the element H i,j of the parity check matrix is non-null.
The optimal LDPC decoding is achieved with a twophase message passing algorithm (also known as flooding), that can be described as an iterative exchange of probabilistic messages along the edges of the Tanner graph [16] . The algorithm proceeds iteratively until a maximum number of iterations or a stopping rule is met. Inputs of the algorithm are the intrinsic Log-Likelihood Ratios (LLRs), also referred to as a priori information. Intrinsic LLRs measure 
where λ n is the intrinsic LLR on the current bit, and M n is the set of CNs connected to VN n. At the same time, a refined estimation on the transmitted bit is produced, also referred to as soft output (SO):
Then, in the next semi-iteration, the generic CN m combines together messages µ 
with N m the set of VNs connected to CN m, and
Considerations on how to implement the transcendental operator in (6) with low-complexity hardware can be found in Section 3.2.
Low-Complexity Decoder Design
Meeting throughput requirements of forthcoming standards ask for a massive parallelization of the decoding process leading to noticeable chip area. On the contrary, architectures with a low degree of parallelism needs high clock frequencies, thus affecting the power consumption that is critical for mobile devices.
The following section reviews different state-of-the-art techniques for reducing the complexity of the decoder in terms of VLSI metrics with the attempt of minimizing the BER implementation loss (IL).
Modified Decoding Schedules
Modified schedules get rid of the classical two-phase flooding decoding (see Section 2) as to improve the convergence speed, so that the same communication performance can be achieved with a reduction in the decoding time. This is advantageous especially for those standard proposals (e.g. WiFi IEEE 802.11n) where the latency budget for channel decoding is poor (6 µs or less). In layered decoding [11, 17] the parity check matrix is considered as being made of a sequence of horizontal or vertical layers, hence the names of horizontal and vertical shuffle also used to indicate this technique. The layered decoding principle for horizontal layers is expressed by:
Equations (7) and (8) were derived by merging the VN and the SO update rules (2)- (3) with the CN update rules (4)- (5). This way the VN phase is spread on the CN updates and the estimate of SO y
n is refreshed after any CN update, as expressed by:
where k is the time step within an iteration. The key point of the layered schedule is the intermediate update, within an iteration, of the SO estimates and their propagation to the next layers, boosting the convergence speed and saving up to 50% of the iteration time [11] .
Note that the layered schedule is compatible with a parallelization of the architecture (i.e. a layer spanning multiple rows), but collision-free decoding can be achieved only for those codes where the rows of the parity matrix are grouped into subsets in which the column weight is at most one [11] . However, it is demonstrated in [19] that approximated layered decoding is very robust even in the support of highly non-layered codes.
Another alternative to the flooding schedule is the Adaptive Single Phase Decoding (ASPD) [5] . Basically, it adaptively updates a single metric for each VN in the codeword. This metric is continuously cumulated in a running sum with leakage and fed to the CN processor combined with the intrinsic LLRs. The leakage coefficients for the running sum are tuned onto each bit in the codeword and dynamically adapted through the decoding process.
The main rationale behind the ASPD is reducing the total memory of the decoder. Memory requirements are decoupled from the number of edges in the Tanner graph and are related only to the length of the codeword. For a reduction in memory size ranging from 60 to 80%, the achieved BER performance in fixed point arithmetic exhibits an IL from 0.2 to 0.3 dB for mid-length codewords (about 4Kbits) [5] .
Node Processors Approximations
In a low complexity scenario the choice for serial architectures of the elementary processing units (CN & VN processors) is almost compulsory. In fact, serial processors offer an inherent low complexity and high flexibility by easily managing different node degrees with very small resources overheads. This last feature is particularly useful for those codes where the node degree is spread (see Section 4.1).
Note that a serial architecture naturally fits most of the approximations proposed for the CN elaborations, where the gate complexity of the decoder is generally concentrated. Indeed, for low-complexity solutions, the direct implementation of the hyperbolic function of (6) is usually avoided by resorting to successive applications of the binary operator:
In [22] the proposed M-min * (11) slightly modifies the original version in order to reduce the complexity, and, above all, to increase the numerical stability in fixed point decoding [18] .
Min-Sum decoding [8] performs the roughest approximation considering in (10) only the minimum, but the correction with a constant term has also been proposed. After selecting one implementation for the binary operator of the CNP, a possible architecture supporting serial data flow and performing exact marginalization is based on a double recursion scheme, as proposed in [18] . This architecture only needs three operators, but data flow control for the forward and backward metrics is quite complex and area consuming. A significant save can be obtained if a dedicated value is computed only for a subset of the incoming messages while a common value is assigned to the remaining ones. In [10] only the set N λ (i) of the λ incoming messages with the lowest reliability is considered for elaboration. In particular the output magnitude of the messages in N λ (i) is evaluated as usual, while for the others a common value is elaborated by applying the binary operator to the whole set of N λ (i).
A similar strategy is followed in [20] where proper marginalization is performed just for the P messages with the lowest reliability but, in this case, the whole set of the input messages is considered for elaboration. In [4] a lowcomplexity CNP has been designed for the DVB-S2 implementing the algorithm described in [10] with λ = 3, while in [20] [20] .
Decoder-Aware LDPC Codes
All current standardized LDPC codes incorporate a common set of features which made feasible implementation possible in the first place: Accumulator-based matrices to allow for linear encoding complexity, structured matrices to be mapped on partly parallel architectures, and permuted identity sub matrices to ease network implementation. These aspects can be summarized under the term of decoder-aware code design. In this section we present three standards employing decoder-aware LDPC codes and show corresponding synthesis results on a 65 nm technology.
Standardized LDPC Codes
The DVB-S2 satellite video broadcasting standard [7] was designed for an exceptional error performance at very low SNR ranges (up to FER ≤ 10 −7 at -2.35dB E S /N 0 ). Thus the specified LDPC codes use a large block length of 64800 bit with 11 different code rates ranging from 1 / 4 to 9 / 10 . This results in large storage requirements for up to 285000 messages and demands high code rate flexibility at the same time to support all specified node degrees, as shown in the first row of Table 1 .
The current WiMax 802.16e [13] standard features LDPC codes as an optional channel coding scheme. It consists of six different code classes with different VN and CN distributions, spanning four different code rates from 1 / 2 to 5 / 6 (see second row of Table 1 ). All six code classes have the same general parity check matrix structure and support 19 codeword sizes, ranging from 576 to 2304 with a granularity of only 96 bit. This codeword size flexibility is the most challenging aspect of this standardized LDPC code family. The interest of the standardization committee in the improvement in throughput and communications performance achievable through layered decoding (see Section 3.1) is evident since an optimal sequence of layers [19] is specified for two of the code classes [2] .
The upcoming WiFi 802.11n [12] standard will also feature LDPC codes as an optional channel coding scheme. It utilizes 12 different codes utilizing four code rates from 1 / 2 to 5 / 6 for each of the three different codeword sizes of 648, 1248, and 1944 bit. The most complicated issue with this code is the CN and VN flexibility needed to fully support this standard (see third row of Table 1 ). 
Application Specific LDPC Codes
For future applications, LDPC decoders with very high throughput and excellent communications performance are required. At the same time, chip area should be small to provide cost-efficient solutions. To get these results, all wellknown techniques used by the standardized codes presented in Section 4.1 have to be exploited. This puts constraints on the code design, but does not provide the actual code itself and still offers a large degree of freedom.
Application specific code design became mandatory in the development of an optional LDPC channel decoder to be used in a sophisticated OFDM-UWB system [21] . Such systems have to provide very high net throughput for upcoming services, therefor LDPC channel coding is suggested to become an addition to the current UWB standard. The code rate of 3 / 4 and codeword size of 9600 bit were given by the system MAC layer. Silicon area has to be as small as possible to enable low-cost consumer products. For the communications performance, the convolutional decoder used in the current UWB standard had to be outperformed at a packet error rate of 10 −3 for the IEEE 802.15.3a CM1 and CM2 channel models.
For this purpose, a so-called Ultra-Sparse LDPC code [3] was designed by using the 2V-PEG approach presented in [15] . In the last row of Table 1 the exceptional small maximum VN and CN degree for a rate 3 / 4 is demonstrated. With only 26400 edges to process, this LDPC code is especially suited for high throughput decoding by using only a small number of parallel functional units. Typically these codeword size corresponds to more than 35000 edges, utilizing 30% more area to achieve the same throughput. It also allows for layered decoding which is very difficult to achieve for rate 3 / 4 codes.
LDPC Decoder Architectures
Two different LDPC decoder architectures are considered: A two-phase decoder implementing the flooding schedule, and one for layered decoding. The structure of both architectures is partly parallel, thus only a subset of nodes in the Tanner graph is instantiated as variable node and check node processors. The node processors work in a serial manner what gives the needed flexibility regarding the variable node and check node degrees. However, this serial architecture prevents the standardized codes from being decoded with layered scheduling due to the latency of d c clock cycles introduced by the CNP. Because of the maximal variable node degree of these codes it can not be guaranteed that the updated message is computed before it is being needed for another check node. For the proposed Ultra-Sparse LDPC code with the low d max v of three this constraint for layered decoding is easily satisfied. In both architectures the permutation networks are realized by logarithmic barrel shifters to process all LDPC codes specified by permuted identity matrices.
There are some fundamental differences between the two-phase decoder (Figure 2 ) and the layered decoder architecture (Figure 3) . The two-phase decoder contains two sum RAMs that are used to accumulate all incoming messages of a variable node. During one iteration one sum RAM is used to compute µ (q) m,n by subtracting the corresponding message from the message RAM and adding the channel value of the channel RAM. The second sum RAM is needed to build new sums for the next iteration, hence both RAMs are swapped after each iteration. In contrast the layered decoder stores the a posteriori information in the channel RAM. This RAM contains the sum of channel value and all incoming messages of a variable node (y (q) n ), thus only the corresponding message has to be subtracted in the check node block (CNB) to obtain µ (q) m,n . When the CPN has computed new messages these are stored in the message RAM and added to the µ (q) m,n bypassed by the FIFO. Using a posteriori values also allows for using only one permutation network for the layered architecture. More details regarding both architectures can be found in [4] and [3] .
Implementation Results
The standardized LDPC codes have been mapped on the two-phase decoder architecture template ( Figure 2 ) and synthesized using the current 65nm technology from STMicroelectronics. The target frequency was set to 400 Mhz which allows for using relatively small memories with cycles times up to 2.5ns for reduced area utilization. All decoder implementations utilize 3-Min check nodes to support adequate communications performance even for the low code rates 1 / 4 -1 / 2 . Table 2 shows the final results for each code, structured by the functional decoder elements shown in Figure 2 . It also shows the net and air throughput range, starting from the smallest size with lowest code rate combination up to the largest high-rate one. Therefor the maximum number of iterations is set to allow sufficient The first column shows the DVB-S2 LDPC decoder implementation with an overall area of 3.8mm
2 , where 85% of the area is dedicated to memory. Instead of utilizing the maximal possible parallelization degree supported by the code structure as already presented in [14] and [4] , the method proposed in [6] has been used to scale down the throughput for set-top box applications. Net throughput ranges from 60 for rate 1 / 4 to more than 700 Mbps for rate 9 / 10 , easily reaching 90 Mbps for rate 3 / 5 as required by broadcast service specifications. The implementation results for the WiMax LDPC decoder are shown in the second column. The size of the network corresponds to 25% of the logic area compared to only 10% for the other standard implementations, reflecting the enormous codeword size flexibility to be supported. More detailed explanation about the network structure is given in [2] . The achieved air throughput is above 70 Mbps as required by the current WiMax specification. The last column presents the results is guaranteed for all codes specified. Table 3 shows synthesis results for the application specific Ultra-Sparse LDPC code, utilizing the previously introduced layered decoder architecture (Figure 3) . By using the much simpler Min-Sum check nodes and fast memories, a clock frequency of 500 MHz was achieved for further throughput enhancement. Although the overall area after synthesis is only around 0.5 mm 2 , the net throughput easily exceeds 1 Gbps.
LDPC

Conclusions
Proposals for next-generation standards selected LDPC codes for forward error correction, either mandatory (DVB-S2) or optional for high-throughput modes (IEEE 802.11n and IEEE 802.16e). Despite the use of joint code/decoder design techniques, implementing a fully compliant LDPC decoder is still a challenging task. The lack of homogeneity in the standardized matrices usually leads to an over dimensioned and/or partially compliant decoder.
This paper compared the implementation of decoders for different standardized codes using a generic reference architecture. Even if state-of-the-art low-complexity techniques have been used along with a present CMOS 65nm technology, area figures may still be too high for integrating the decoder in a full modem, especially for low-cost consumer products. It has been also demonstrated that further reduction on chip area can be achieved if the code is designed enforcing the awareness of the decoder implementation bottlenecks. Indeed future standard like UWB propos-LDPC Code f [3, 2] = { 3 / 4 , 1 / 4 } g [11] Table 3 . Synthesis Results for a proposed UWB LDPC Code als have to take into account other issues such as the impact of node distribution or the decoding schedule, in addition to the well-known actual block-partitioned matrices.
