Abstract Low-density parity-check (LDPC) codes and convolutional Turbo codes are two of the most powerful error correcting codes that are widely used in modern communication systems. In a multi-mode baseband receiver, both LDPC and Turbo decoders may be required. However, the different decoding approaches for LDPC and Turbo codes usually lead to different hardware architectures. In this paper we propose a unified message passing algorithm for LDPC and Turbo codes and introduce a flexible soft-input soft-output (SISO) module to handle LDPC/Turbo decoding. We employ the trellis-based maximum a posteriori (MAP) algorithm as a bridge between LDPC and Turbo codes decoding. We view the LDPC code as a concatenation of n super-codes where each super-code has a simpler trellis structure so that the MAP algorithm can be easily applied to it. We propose a flexible functional unit (FFU) for MAP processing of LDPC and Turbo codes with a low hardware overhead (about 15% area and timing overhead). Based on the FFU, we propose an area-efficient flexible SISO decoder architecture to support LDPC/Turbo codes decoding. Multiple such SISO modules can be embedded into a parallel decoder for higher decoding throughput. As a case study, a flexible LDPC/Turbo decoder has been synthesized on a TSMC 90 nm CMOS technology with a core area of 3.2 mm 2 . The decoder can support IEEE 802.16e LDPC codes, IEEE 802.11n LDPC codes, and 3GPP LTE
Introduction
Practical wireless communication channels are inherently "noisy" due to the impairments caused by channel distortions and multipath effect. Error correcting codes are widely used to increase the bandwidth and energy efficiency of wireless communication systems. As a core technology in wireless communications, forward error correction (FEC) coding has migrated from basic convolutional/block codes to more powerful Turbo codes and LDPC codes. Turbo codes, introduced by Berrou et al. in 1993 [4] , have been employed in 3G and beyond 3G wireless systems, such as UMTS/WCDMA and 3GPP Long-Term Evolution (LTE) systems. As a candidate for 4G coding scheme, LDPC codes, which were introduced by Gallager in 1963 [13] , have recently received significant attention in coding theory and have been adopted by some advanced wireless systems such as IEEE 802.16e WiMAX system and IEEE 802.11n WLAN system. In future 4G networks, internetworking and roaming between different networks would require a multi-standard FEC decoder. Since Turbo codes and LDPC codes are widely used in many different 3G/4G systems, it is important to design a configurable decoder to support multiple FEC coding schemes.
In the literature, many efficient LDPC decoder VLSI architectures have been studied [6, 9, 12, 14, 18, 24, 27, 29, 35, 37, 39, 45, 47] . Turbo decoder VLSI architectures have also been extensively investigated by many researchers [5, 8, 20, 21, 25, 30, 33, 41, 44] . However, designing a flexible decoder to support both LDPC and Turbo codes still remains very challenging. In this paper, we aim to provide an alternative to dedicated silicon that reduces the cost of supporting both LDPC and Turbo codes with a small additional overhead. We propose a flexible decoder architecture to meet the needs of a multi-standard FEC decoder.
From the theoretical point of view, there are some similarities between LDPC and Turbo codes. They can both be represented as codes on graphs which define the constraints satisfied by codewords. Both families of codes are decoded in an iterative manner by employing the sum-product algorithm or belief propagation algorithm. For example, MacKay has related these two codes by treating a Turbo code as a lowdensity parity-check code [23] . On the other hand, a few other researchers have tried to treat a LDPC code as a Turbo code and apply a turbo-like message passing algorithm to LDPC codes. For example, Mansour and Shanbhag [24] introduce an efficient turbo message passing algorithm for architecture-aware LDPC codes. Hocevar [18] proposes a layered decoding algorithm which treats the parity check matrix as horizontal layers and passes the soft information between layers to improve the performance. Zhu and Chakrabarti [50] looked at the super-code based LDPC construction and decoding. Zhang and Fossorier [46] suggest a shuffled belief propagation algorithm to achieve a faster decoding speed. Lu and Moura [22] propose to partition the Tanner graph into several trees and apply the turbo-like decoding algorithm in each tree for faster convergence rate. Dai et al. [12] introduce a turbo-sum-product hybrid decoding algorithm for quasi-cyclic (QC) LDPC codes by splitting the parity check matrix into two submatrices where the information is exchanged.
In our early work [38] , we have proposed a supercode based decoding algorithm for LDPC codes. In this paper, we extend this algorithm and present a more generic message passing algorithm for LDPC and Turbo decodings, and then exploit the architecture commonalities between LDPC and Turbo decoders. We create a connection between LDPC and Turbo codes by applying a super-code based decoding algorithm, where a code is divided into multiple super-codes and then the decoding operation is performed by iteratively exchanging the soft information between supercodes. In the LDPC decoding, we treat a LDPC code as a concatenation of n super-codes, where each supercode has a simpler trellis structure so that the maximum a posteriori (MAP) algorithm can be efficiently performed. In the Turbo decoding, we modify the traditional message passing flow so that the proposed supercode based decoding scheme works for Turbo codes as well.
Contributions of this paper are as follows. First, we introduce a flexible soft-input soft-output (Flex-SISO) module for LDPC and Turbo codes decoding. Second, we introduce an area-efficient flexible functional unit (FFU) for implementing the MAP algorithm in hardware. Third, we propose a flexible SISO decoder hardware architecture based on the FFU. Finally, we show how to enable parallel decoding by using multiple such Flex-SISO decoders.
The remainder of the paper is organized as follows. Section 2 reviews the super-code based decoding algorithm for LDPC codes. Section 3 presents a Flex-SISO module for LDPC/Turbo decoding. Section 4 introduces a flexible functional unit (FFU) for LDPC and Turbo decoding. Based on the FFU, Section 5 describes a dual-mode Flex-SISO decoder architecture. Section 6 presents a parallel decoder architecture using multiple Flex-SISO cores. Section 7 compares our flexible decoder with existing decoders in the literature. Finally, Section 8 concludes the paper.
Review of Super-code Based Decoding Algorithm for LDPC Codes
By definition, a Turbo code is a parallel concatenation of two super-codes, where each super-code is a constituent convolutional code. Naturally, Turbo decoding procedure can be partitioned into two phases where each phase corresponds to one super-code processing. Similarly, LDPC codes can also be partitioned into super-codes for efficient processing as previously mentioned in Section 1. Before proceeding with a discussion of the proposed flexible decoder architecture, it is desirable to review the super-code based LDPC decoding scheme in this section.
Trellis Structure for LDPC Codes
A binary LDPC code is a linear block code specified by a very sparse binary M × N parity check matrix:
where x is a codeword (x ∈ C) and H can be viewed as a bipartite graph where each column and row in H represent a variable node and a check node, respectively. Each element of the parity check matrix is either a zero or a one, where nonzero elements are typically placed at random positions to achieve good performance. The number of nonzero elements in each row or each column of the parity check matrix is called check node degree or variable node degree. A regular LDPC code has the same check node and variable node degrees, whereas an irregular LDPC code has different check node and variable node degrees. The full trellis structure of an LDPC code is enormously large, and it is impractical to apply the MAP algorithm on the full trellis. However, alternately, a (N, M-N) LDPC code can be viewed as M parallel concatenated single parity check codes. Although the performance of a single parity check code is poor, when many of them are sparsely connected they become a very strong code. Figure 1 shows a trellis representation for LDPC codes where a single parity check code is considered as a low-weight two-state trellis, starting at state 0 and ending at state 0.
Layered Message Passing Algorithm for LDPC Codes
The main idea behind the layered LDPC decoding is essentially the Turbo message passing algorithm [24] . It has been shown that the layered message passing c1  c2  c3  c4   v2  v1  v3  v4  v5  v6   c1   v2  v1   c3   v3  v4  v5   c2   v2  v1  v3   c4   v4  v5  v6 Original factor graph Sub factor graph 1 Sub factor graph 2
Figure 2
Dividing a factor graph into sub-graphs. algorithm can achieve a faster convergence rate than the standard two-phase message-passing algorithm for structured LDPC codes [18, 24] . To be more general, we can divide the factor graph of an LDPC code into several sub-graphs [38] as illustrated in Fig. 2 . Each subgraph corresponds to a super-code. If we restrict that each sub-graph is loop-free, then each super-code has a simpler trellis structure so that the MAP algorithm can be efficiently performed. As a special example, the block-structured QuasiCyclic (QC) LDPC codes used in many practical communication systems such as 802.16e and 802.11n can be easily decomposed into several super-codes. As shown in Fig. 3 , a block structured parity check matrix can be viewed as a 2-D array of square sub-matrices. Each sub-matrix is either a zero matrix or a z-by-z cyclically shifted identity matrix I z(x) with random shift value x. The parity check matrix can be viewed as a concatenation of n super-codes where each block row or layer defines a super-code. In the layered message passing algorithm, soft information generated by one supercode can be used immediately by the following supercodes which leads to a faster convergence rate [24] .
Flexible SISO Module
In this section, we propose a flexible soft-input softoutput (SISO) module, named Flex-SISO module, to decode LDPC and Turbo codes. The SISO module is based on the MAP algorithm [3] . To reduce complexity, the MAP algorithm is usually calculated in the log domain [31] . In this paper, we assume the MAP algorithm is always calculated in the log domain.
The decoding algorithm underlying the Flex-SISO module works for codes which have trellis representations. For LDPC codes, a Flex-SISO module was used A generic description of the message passing algorithm is as follows. Multiple Flex-SISO modules are connected in series to form an iterative decoder. First, the Flex-SISO module receives the soft values λ i (u) from upstream Flex-SISO modules and the channel values (for parity bits) λ c ( p) if available. The λ i (u) can be thought of as the sum of the channel value λ c (u) (for information bit) and all the extrinsic values λ e (u) previously generated by all the super-codes:
Note that prior to the iterative decoding, λ i (u) should be initialized with λ c (u). Next, the old extrinsic value λ e (u; old) generated by this Flex-SISO module in the previous iteration is subtracted from λ i (u) as follows:
Then, the new extrinsic value λ e (u; new) can be computed using the MAP algorithm based on λ t (u), and λ c ( p) if available. Finally, the APP value is updated as
Then this updated APP value is passed to the downstream Flex-SISO modules. This computation repeats in each sub-iteration.
Flex-SISO Module to Decode LDPC Codes
In this section, we show how to use the Flex-SISO module to decode LDPC codes. Because QC-LDPC codes are widely used in many practical systems, we will primarily focus on the QC-LDPC codes. First, we decompose a QC-LDPC code into multiple supercodes, where each layer of the parity check matrix defines a super-code. After the layered decomposition, each super-code comprises z independent two-state single parity check codes. Figure 5 shows the super-code based, or layered, LDPC decoder architecture using the Flex-SISO modules. The decoder parallelism at each Flex-SISO module is at the level of the sub-matrix size z, because these z single parity codes have no data dependency and can thus be processed simultaneously. This architecture differs than the regular two-phase LDPC decoder in that a code is partitioned into multiple sections, and each section is processed by a same processor. The convergence rate can be twice faster than that of a regular decoder [18] .
... Since the data flow is the same between different sub-iterations, one physical Flex-SISO module is instantiated, and it is re-used at each sub-iteration, which leads to a partial-parallel decoder architecture. Figure 6 shows an iterative LDPC decoder hardware architecture based on the Flex-SISO module. The structure comprises an APP memory to store the soft APP values, an extrinsic memory to store the extrinsic values, and a MAP processor to implement the MAP algorithm for z single parity check codes. Prior to the iterative decoding process, the APP memory is initialized with channel values λ c (u), and the extrinsic memory is initialized with 0.
The decoding flow is summarized as follows. It should be noted that the parity bits are treated as information bits for the decoding of LDPC codes. We use the symbol u k to represent the k-th data bit in the codeword. For check node m, we use the symbol u m,k to denote the k-th codeword bit (or variable node) that is connected to this check node m. To remove correlations between iterations, the old extrinsic message is subtracted from the soft input message to create a temporary message λ t as follows
where λ i (u k ) is the soft input log likelihood ratio (LLR) and λ e (u m,k ; old) is the old extrinsic value generated by this MAP processor in the previous iteration. Then the new extrinsic value can be computed as:
where the operation is associative and commutative, and is defined as [15] 
Finally, the new APP value is updated as:
For each sub-iteration l, Eqs. (5)- (8) can be executed in parallel for check nodes m = lz to lz + z − 1 because there are no data dependency between them.
Flex-SISO Module to Decode Turbo Codes
In this section, we show how to use the Flex-SISO module to decode Turbo codes. A Turbo code can be naturally partitioned into two super-codes, or constituent codes. In a traditional Turbo decoder, where the extrinsic messages are exchanged between two super-codes, the Flex-SISO module can not be directly applied, because the Flex-SISO module requires the APP values, rather than the extrinsic values, being exchanged between super-codes. In this section, we made a small modification to the traditional Turbo decoding flow so that the APP values are exchanged in the decoding procedure.
Review of the Traditional Turbo Decoder Structure
The traditional Turbo decoding procedure with two SISO decoders is shown in Fig. 7 . The definitions of the symbols in the figure are as follows. The information bit and the parity bits at time k are denoted as u k and ( p
The a priori LLR, the extrinsic LLR, and the APP LLR for u k are denoted as λ a (u k ), λ e (u k ), and λ o (u k ), respectively. In the decoding process, the SISO decoder computes the extrinsic LLR value at time k as follows:
The α and β metrics are computed based on the forward and backward recursions:
where the branch metric γ k is computed as:
The extrinsic branch metric γ e k in Eq. 9 is computed as:
The max * (·) function in Eqs. 9-11 is defined as:
The soft APP value for u k is generated as:
In the first half iteration, SISO decoder 1 computes the extrinsic value λ 1 e (u k ) and pass it to SISO decoder 2. Thus, the extrinsic value computed by SISO decoder 1 becomes the a priori value λ 2 a (u k ) for SISO decoder 2 in the second half iteration. The computation is repeated in each iteration. The iterative process is usually terminated after certain number of iterations, when the soft APP value λ o (u k ) converges.
Modif ied Turbo Decoder Structure Using Flex-SISO Modules
In order to use the proposed Flex-SISO module for Turbo decoding, we modify the traditional Turbo decoder structure. Figure 8 shows the modified Turbo decoder structure based on the Flex-SISO modules.
It should be noted that the modified Turbo decoding flow is mathematically equivalent to the original Turbo decoding flow, but uses a different message passing method. The modified data flow is as follows. In the first half iteration, Flex-SISO decoder 1 receives soft LLR value λ
To relate to the traditional Turbo decoder structure, this temporary message is mathematically equal to the sum of the channel value λ c (u k ) and the a priori value λ a (u k ) in Fig. 7 :
Thus, the branch metric calculation in Eq. 12 can be rewritten as:
The extrinsic branch metric (γ e k ) calculation, and the extrinsic LLR (λ e (u k )) calculation, however, remain the same as Eqs. 13 and 9-11, respectively. Finally, the soft APP LLR output is computed as:
In the Flex-SISO based iterative decoding procedure, the soft outputs λ they become the soft inputs λ 2 i (u) for Flex-SISO decoder 2 in the second half iteration. The computation is repeated in each half-iteration until the iteration converges. Since the operations are identical between two sub-iterations, only one physical Flex-SISO module is instantiated, and it is re-used for two sub-iterations. Figure 9 shows an iterative Turbo decoder architecture based on the Flex-SISO module. The architecture is very similar to the LDPC decoder architecture shown in Fig. 6 . The main differences are: 1) the Turbo decoder has separate parity channel LLR inputs whereas the LDPC decoder treats parity bits as information bits, 2) the Turbo decoder employs the MAP algorithm on an N-state trellis whereas the LDPC decoder applies the MAP algorithm on z independent two-state trellises, and 3) the interleaver/permuter structures are different (not shown in the figures). But despite these differences, there are certain important commonalities. The message passing flows are the same. The memory organizations are similar, but with a variety of sizes depending on the codeword length. The MAP processors, which will be described in the next section, have similar functional unit resources that will be configured using multiplexors for each algorithm. Thus, it is natural to design a unified SISO decoder with configurable MAP processors to support both LDPC and Turbo codes.
Design of a Flexible Functional Unit
The MAP processor is the main processing unit in both LDPC and Turbo decoders as depicted in Fig. 6 and Fig. 9 . In this section, we introduce a flexible functional unit to decode LDPC and Turbo codes with a small additional overhead.
MAP Functional Unit for Turbo Codes
In a Turbo MAP processor, the critical path lies in the state metric calculation unit which is often referred to as add-compare-select-add (ACSA) unit. As depicted in Fig. 10 , for each state m of the trellis, the decoder needs to perform an ACSA operation as follows:
where α 0 and α 1 are the previous state metrics, and γ 0 and γ 1 are the branch metrics. Figure 10b shows a circuit implementation for the ACSA unit, where a signed-input look-up table "LUT-S" was used to implement the non-linear function log(1 + e −|x| ). This circuit can be used to recursively compute the forward and backward state metrics based on Eqs. 10 and 11.
MAP Functional Unit for LDPC Codes
In the layered QC-LDPC decoding algorithm, each super-code comprises z independent single parity check codes. Each single parity check code can be viewed as a terminated two-state convolutional code. Figure 11 shows an example of the trellis structure for a single parity check node.
An efficient MAP decoding algorithm for single parity check code was given in [16] : for independent Forward Recursion: ak+1=f (ak, γ k) Figure 12 A forward-backward decoding flow to compute the extrinsic LLRs for single parity check code.
random variables u 0 , u 1 , ..., u l the extrinsic LLR value for bit u k is computed as:
where the compact notation ∼{u k } represents the set of all the variables with u k excluded. For brevity, we define a function f (a, b ) to represent the operation
where a λ i (u 1 ) and b λ i (u 2 ). Figure 12 shows a forward-backward decoding flow to implement Eq. 21. The forward (α) and backward (β) recursions are defined as:
where γ k = λ i (u k ) and is referred to as the branch metric as an analogy to a Turbo decoder. The α and β metrics are initialized to +∞ in the beginning. Based on the α and β metrics, the extrinsic LLR for u k is computed as: 
Compared to the classical "tanh" function used in LDPC decoding (x) = − log(tanh(|x/2|)), the f (·) function is numerically more robust and less sensitive to quantization noise. Due to its widely dynamic range (up to +∞), the (x) function has a high complexity and is prone to quantization noise. Although many approximations have been proposed to improve the numerical accuracy of (x) [26, 29, 48] , it is still expensive to implement the (x) function in hardware. However, the non-linear term in the f (·) function has a very small dynamic range:
thus the f (·) function is more easily to be implemented in hardware by using a low complexity look-up table (LUT). To implement g(x) in hardware, we propose to use a four-value LUT approximation which is shown in Table 1 . For fixed point implementation, we propose to use Q.2 quantization scheme (Q total bits with 2 fractional bits). Table 2 shows the proposed LUT implementation for Q.2 quantization. It should be noted that g(x) is the same as the non-linear term in the Turbo max * (·) function (c.f. Eq. 14). Thus, the same look-up table configuration can be applied to the Turbo ACSA unit. In Section 4.4, we will show the decoding performance by using this look-up table. Figure 14 depicts a circuit implementation for the LDPC | f (a, b )| functional unit using two look-up tables "LUT-S" and "LUT-U", where LUT-S and LUT-U implement log(1 + e − |a|−|b | ) and log(1 + e −(|a|+|b |) ), respectively. The difference between LUT-S and LUT-U is that: LUT-S is a signed-input look-up table that takes both positive and negative data inputs whereas LUT-U is an unsigned-input look-up table (half size of LUT-S) that only takes positive data inputs.
Proposed Flexible Functional Unit (FFU)
If we compare the LDPC | f (a, b )| functional unit (c.f. Fig. 14) with the Turbo ACSA functional unit (c.f. Fig. 10 ), we can see that they have many commonalities except for the position of the look-up tables and the multiplexor. To support both LDPC and Turbo codes with minimum hardware overhead, we propose a flexible functional unit (FFU) which is depicted in Fig. 15 . We modify the look-up table structure so that each look-up table can be bypassed when the bypass control signal is high. A select signal was used to switch between the LDPC mode and the Turbo mode. The functionality of the proposed FFU architecture is summarized in Table 3 . The word lengths for X, Y, V, and W are all 9 bits. To evaluate the area efficiency of the proposed FFU, we have described the LDPC f (a, b ) 
and synthesized them on a TSMC 90 nm CMOS technology. The maximum achievable frequency (assuming no clock skews) and the synthesized area at two frequencies (400 and 800 MHz) are summarized in Table 4 . As can be seen, the proposed flexible functional unit FFU has only about 15% area and timing overhead compared to the dedicated functional units. The area efficiency is achieved because many logic gates can be shared between LDPC and Turbo modes.
Fixed Point Decoding Performance
To evaluate the fixed-point decoding performance using the look-up table based FFU, we perform float-point and bit-accurate fixed-point simulations for LDPC and Turbo codes using BPSK modulation over an AWGN channel. As a good trade-off between complexity and performance, we use 6.2 quantization scheme for channel LLR inputs for fixed-point LDPC and Turbo decoders. Figure 16 shows the bit error rate (BER) simulation result for a WiMAX LDPC code with code-rate = 1/2, and code-length = 2,304. The maximum number of iterations is 15. As can be seen from Fig. 16 , the fixed-point FFU solution has a very small performance degradation (< 0.05 dB) at BER level of 10 −6 compared to the floating point solution. We also plot a BER curve for the scaled minsum solution [11] , which is a sub-optimal approximation algorithm without using the look-up tables. As can be seen from the figure, the look-up table based FFU solution can deliver a better decoding performance than the scaled minsum solution. The complexity of adding the look-up tables is relatively small because the word length of the data in the look-up table is only 2-bit. Figure 17 compares the convergence speed of the layered decoding algorithm with the standard two-phase decoding algorithm. Figure 18 shows the BER simulation result for 3GPP-LTE Turbo codes with block sizes of 6,144, 1,024, 240, and 40. The maximum number of Turbo iterations is 6 (12 half iterations). The sliding window length is 32. As can be seen from the figure, the FFU based fixed-point decoder has almost no performance loss compared to the floating point case. The proposed FFU solution will deliver a better decoding performance than the sub-optimal max-logMAP solution.
From these simulation results, we conclude that the proposed look-up table based FFU is a good solution for supporting high performance LDPC and Turbo decoding requirements.
Design of A Flexible SISO Decoder
Built on top of the FFU arithmetic unit, we introduce a flexible SISO decoder architecture to handle LDPC and Turbo codes. Figure 19 illustrates the proposed dual-mode SISO decoder architecture. The decoder comprises four major functional units: alpha unit (α), beta unit (β), extrinsic-1 unit, and extrinsic-2 unit. The decoder can be reconfigured to process: i) an eight-state convolutional Turbo code, or ii) 8 single parity check codes.
Turbo Mode
In the Turbo mode, all the elements in the Flex-SISO decoder will be activated. For Turbo decoding, we use the Next Iteration Initialization (NII) sliding window algorithm as suggested in [1, 19] . The NII approach can avoid the calculation of training sequences as initialization values for the β state metrics, instead the boundary metrics are initialized from the previous iter- ation. As a result, the decoding latency is smaller than the traditional sliding window algorithm which requires a calculation of training sequences [25, 43] , and thus only one β unit is required. Moreover, this solution is very suitable for high code-rate Turbo codes, which require a very long training sequence to obtain reliable boundary state metrics. Note that this scheme would require an additional memory to store the boundary state metrics. A dataflow graph for NII sliding window algorithm is depicted in Fig. 20 , where the X-axis represents the trellis flow and the Y-axis represents the decoding time so that a box may represent the processing of a block of L data in L time steps, where L is the sliding window size. In the decoding process, the α metrics are computed in the natural order whereas the β metrics and the extrinsic LLR (λ e ) are computed in the reverse order. By using multiple FFUs, the α and β units are able to compute the state metrics in parallel, leading to a real time decoding with a latency of L.
The decoder works as follows. The decoder uses soft LLR value λ i (u) and old extrinsic value λ e (u; old) to compute λ t (u) based on Eq. 16. A branch metric calculation (BMC) unit is used to compute the branch metrics γ (u, p) based on Eq. 18, where u, p ∈ {0, 1}. Then the branch metrics are buffered in a γ stack for backward (β) metric calculation. The α and β metrics are computed using Eqs. 10 and 11. The boundary β metrics are initialized from an NII buffer (not shown in Fig. 19) . A dispatcher unit is used to dispatch the data to the correct FFUs in the α/β unit. Each α/β unit has fully-parallel FFUs (eight of them), so the eight-state convolutional trellis can be processed at a rate of onestage per clock cycle.
To compute the extrinsic LLR as defined in Eq. 9, we first add β metrics with the extrinsic branch metrics γ e ( p), where γ e ( p) is retrieved from the γ stack, as
The extrinsic LLR calculation is separated into two phases which is shown in the right part of Fig. 19 . In phase 1, the extrinsic-1 unit performs eight ACSA operations in parallel using eight FFUs. In phase 2, the extrinsic-2 unit performs 6 max * (a, b ) operations and 1 subtraction. Finally, the soft LLR λ o (u) is obtained by adding λ e (u; new) with λ t (u), where λ t (u) is also retrieved from the γ stack, as λ t (u) = γ (1, 0).
LDPC Mode
In the LDPC mode, a substantial subset (more than 90%) of the logic gates will be reused from the Turbo mode. As shown in Fig. 21 , three major functional units (α unit, β unit, and the extrinsic-1 unit) and two stack memories are reused in the LDPC mode. The extrinsic-2 unit will be de-activated in the LDPC mode.
The decoder can process 8 single parity check codes in parallel because each of the α unit, β unit, and extrinsic-1 unit has eight parallel FFUs. The dataflow graph of the LDPC decoding (c.f. Fig. 12 ) is very similar to that of the Turbo decoding (c.f. Fig. 20 ). The decoder works as follows. The decoder first computes λ t (u) based on Eq. 5. In the LDPC mode, the branch metric γ is equal to λ t (u). Prior to decoding, the α and β metrics are initialized to the maximum value. Assuming the check node degree is L. In the first L cycles, the α unit recursively computes the α metrics in the forward direction and store them in an α stack. In the next L cycles, the β unit recursively computes the β metrics in the backward direction. At the same time, the extrinsic-1 unit computes the extrinsic LLRs using the α and β metrics. While the β unit and the extrinsic-1 unit are working on the first data stream, the α unit can work on the second stream which leads to a pipelined implementation. 
Performance
The proposed Flex-SISO decoder has been synthesized on a TSMC 90 nm CMOS technology. For high throughput applications, it is necessary to use multiple SISO decoders working in parallel to increase the decoding speed. For parallel Turbo decoding, multiple SISO decoders can be employed by dividing a codeword block into several sub-blocks and then each sub-block is processed separately by a dedicated SISO decoder [7, 20, 30, 41, 42] . For LDPC decoding, the decoder parallelism can be achieved by employing multiple check node processors [10, 14, 32, 40, 49] . Based on the Flex-SISO decoder core, we proposed a parallel LDPC/Turbo decoder architecture which is shown in Fig. 22 . As depicted, the parallel decoder comprises P Flex-SISO decoder cores. In this architecture, there are three types of storage. Extrinsic memory (Ext-Mem) is used for storing the extrinsic LLR values produced by each SISO core. APP memory (APPMem) is used to store the initial and updated LLR values. The APP memory is partitioned into multiple banks to allow parallel data transfer. Turbo parity memory is used to store the channel LLR values for each parity bit in a Turbo codeword. This memory is not used for LDPC decoding (parity bits are treated as information bits for LDPC decoding). Two permuters are used to perform the permutation of the APP values back and forth.
As a case study, we have designed a high-throughput, flexible LDPC/Turbo decoder to support the following three codes: 1) 802.16e WiMAX LDPC code, 2) 802.11n WLAN LDPC code, and 3) 3GPP-LTE Turbo code. Table 6 summarizes the performance and design parameters for this decoder. The number of the Flex-SISO decoders is chosen to be 12.
For LDPC decoding, with 12 available Flex-SISO cores the decoder can process up to 12 × 8 = 96 check nodes simultaneously. Because the sub-matrix size z is between 24 to 96 for 802.16e LDPC codes, and 27 to 81 for 802.11n, the proposed decoder always guarantees that all of the z check nodes within a layer can be processed in parallel.
For 3GPP-LTE Turbo decoding, the codeword can be partitioned into M sub-blocks for parallel processing. LTE Turbo code uses a quadratic permutation polynomial (QPP) interleaver [36] so that it allows conflict free memory access as long as M is a factor of the codeword length. There are 188 different codeword sizes defined in LTE. For LTE Turbo codes, all of the codewords can support a parallelism level of 8, some of the codewords can support parallelism level of 10 or 12.
Because we have 12 Flex-SISO cores available, we will dynamically allocate the maximum possible number of Flex-SISO cores (8 ≤ M ≤ 12) constrained on the QPP interleaver parallelism. As an example, for the maximum codeword size of 6144, we can allocate all of the 12 Flex-SISO cores to work in parallel. It should be noted that the parallelism level has some impact on the error performance of the decoder due to the edge effects caused by the sub-block partitioning [17] .
This parallel and flexible decoder has been implemented in Verilog HDL and synthesized on a TSMC 90 nm CMOS technology using Synopsys Design Compiler. The maximum clock frequency of this decoder is 500 MHz. The synthesized core area is 3.2 mm 2 , which includes all of the components in this decoder. Table 6 summarizes the features of this decoder. The decoder can be configured to support IEEE 802.16e LDPC codes, IEEE 802.11n LDPC codes, and 3GPP LTE Turbo codes. Compared to a dedicated LDPC decoder solution [37] , this flexible decoder has only about 15-20% area overhead when normalized to the same throughput target (with the same number of iterations). Compared to a dedicated Turbo decoder solution [30] , our flexible decoder shows only about 10-20% area overhead when normalized to the same technology and the same throughput and code length.
Related Work and Architecture Comparison
Multi-mode Turbo decoders are an increasingly important component in mobile wireless devices. To support multi-mode decoding, the ASIC/ASIP/MPSoC/SIMD architectures have been recently proposed [2, 28, 34] . In [2] , a reconfigurable application-specific instructionset processor (ASIP) architecture is presented for convolutional, Turbo, and LDPC code decoding. In [34] , a multi processor system on chip (MPSoC) architecture is described for LDPC and Turbo code decoding. In [28] , a SIMD-like processor architecture is proposed for Viterbi, Turbo, Reed-Solomon, and LDPC decoding. Table 7 shows the architecture comparison and tradeoff analysis of these decoders. Each approach has different benefit in terms of flexibility. Our focus is to achieve highest throughput for both LDPC and Turbo codes. As can be seen from the table, the proposed decoder can support very high throughput LDPC/Turbo decoding at a small silicon area cost.
Conclusion
In this work, we present a flexible decoder architecture to support LDPC and Turbo codes. We propose a dual-mode Flex-SISO decoder as a basic building block in LDPC and Turbo decoders. Our study has been focused on the Flex-SISO decoder architecture design and implementation. We unify the decoding process for LDPC and Turbo codes so that the same Flex-SISO decoder can be re-used for both cases resulting in more than 80% resource sharing. To increase decoding throughput, we propose a parallel LDPC/Turbo decoder using multiple Flex-SISO cores. With a core area of 3.2 mm 2 , the decoder is able to sustain 600 Mbps 802.11e LDPC decoding, 500 Mbps 802.11n LDPC decoding, or 450 Mbps 3GPP LTE Turbo decoding. The proposed architecture can significantly reduce the cost of a multi-mode receiver.
