Abstract-The current convergence process in wireless technologies demands for strong efforts in the conceiving of highly flexible and interoperable equipments. This contribution focuses on one of the most important baseband processing units in wireless receivers, the forward error correction unit, and proposes a Network-on-Chip (NoC) based approach to the design of multi-standard decoders. High level modeling is exploited to drive the NoC optimization for a given set of both turbo and Low-Density-Parity-Check (LDPC) codes to be supported. Moreover, synthesis results prove that the proposed approach can offer a fully compliant WiMAX decoder, supporting the whole set of turbo and LDPC codes with higher throughput and an occupied area comparable or lower than previously reported flexible implementations. In particular, the mentioned design case achieves a worst-case throughput higher than 70 Mb/s at the area cost of 3.17 mm 2 on a 90 nm CMOS technology.
Abstract-The current convergence process in wireless technologies demands for strong efforts in the conceiving of highly flexible and interoperable equipments. This contribution focuses on one of the most important baseband processing units in wireless receivers, the forward error correction unit, and proposes a Network-on-Chip (NoC) based approach to the design of multi-standard decoders. High level modeling is exploited to drive the NoC optimization for a given set of both turbo and Low-Density-Parity-Check (LDPC) codes to be supported. Moreover, synthesis results prove that the proposed approach can offer a fully compliant WiMAX decoder, supporting the whole set of turbo and LDPC codes with higher throughput and an occupied area comparable or lower than previously reported flexible implementations. In particular, the mentioned design case achieves a worst-case throughput higher than 70 Mb/s at the area cost of 3.17 mm 2 on a 90 nm CMOS technology. Index Terms-VLSI, LDPC Decoder, NoC, Flexibility, Wireless communications I. INTRODUCTION Wireless communications employ high-performance forward error correction codes as turbo [1] and Low-DensityParity-Check (LDPC) [2] codes to achieve reliable transmission. Excellent error correction performance of turbo and LDPC codes are obtained at the expense of significant complexity at the decoder side. Even if the implementation of turbo and LDPC code decoders is a well studied problem in the literature, two critical needs emerged in the last years: i) achieving high throughput, ii) granting flexibility and interoperability. The intrinsic differences between the turbo and LDPC decoding algorithms and their iterative nature make the design of high throughput, flexible turbo/LDPC decoder architectures a challenging task.
In both turbo and LDPC decoders high throughput is routinely achieved by employing parallel architectures [3] , [4] , where several processing elements (PEs) perform the decoding algorithm concurrently on different portions of the received frame. However, PEs require a large bandwidth and an efficient interconnection structure to concurrently read/write data from/to the memory. High throughput PEs able to support both turbo and LDPC decoding [5] [6] [7] [8] [9] can be implemented as Application-Specific-Integrated-Circuits (ASICs) or Application-Specific-Instruction-set-Processors (ASIPs). In general, ASIC solutions achieve higher throughput with lower complexity as compared to ASIP implementations. However, ASIP architectures are usually more flexible than ASIC ones.
Stemming from the general Network-on-Chip (NoC) paradigm [10] , Neeb et alii [11] proposed an interesting NoCbased approach to enable flexible and efficient interconnection among the processing elements in parallel turbo decoder architectures. According to [12] this approach, where the network structure is used to connect PEs belonging to the same Intellectual Property (IP), is referred to as intra-IP NoC. In [13] the intra-IP NoC approach is studied in the context of parallel turbo decoder architectures investigating a number of direct and indirect networks. A similar approach has been employed for LDPC decoder architectures [12] , [14] . Few recent works [7] , [9] , [15] tried to exploit the intra-IP NoC approach to design flexible turbo/LDPC decoder architectures. However, from these works it is not clear how the design of the PEs and the design of the network influence each other. Moreover most of these implementations are not fully compliant with the high throughput requirements of modern wireless standards.
This paper exploits the Turbo NoC cycle-accurate simulation tool [16] described in [17] , where an extensive analysis of the performance achieved by various NoC topologies in the context of turbo decoder architectures is shown. The contribution of this paper is twofold: i) to propose a competitive intra-IP-NoC-based ASIC architecture for flexible turbo/LDPC decoding with a clear design flow; ii) to prove that the flexibility achieved via the intra-IP-NoC-based approach has a limited impact on the area of the decoder architecture. As a case of study the WiMAX standard is considered. The architecture has complexity comparable to the latest state-ofthe-art proposed flexible turbo/LDPC decoders, together with a higher worst-case throughput, and guarantees multi-standard compliance and low power consumption.
The paper is organized as follows: in Section II turbo and LDPC decoding algorithms are summarized, whereas Sections III and IV deal with the architectures of the NoC interconnection structure and the PEs respectively. Experimental results for the WiMAX standard are shown in Section V and conclusions are drawn in Section VI.
II. DECODING ALGORITHMS
Algorithms used to decode turbo and LDPC codes are both iterative and based on data processing and message passing phases. The two phases can be partially overlapped employing pipeline architectures to increase the throughput. In the following paragraphs these algorithms are summarized.
A. Turbo code decoding algorithm
Convolutional turbo codes rely on the parallel concatenation of two constituent convolutional codes (CC) by the means of an interleaver. The decoding algorithm is based on the iterative exchange of extrinsic information between the two constituent decoders, usually referred to as Soft-In-Soft-Out (SISO) units. Extrinsic information is exchanged according to the order imposed by the permutation law defined by the interleaver. A SISO produces, at each step k, the extrinsic information of the corresponding uncoded symbol u, that is
is the extrinsic information produced during the previous half iteration, whereas the a-posteriori information λ where
is the systematic component of the intrinsic information,ũ ∈ U is an uncoded symbol taken as a reference (usuallyũ = 0) and u ∈ U \ {ũ} with U the set of uncoded symbols; e is a trellis transition and u(e) is the corresponding uncoded symbol. The * max{x i } function is implemented as max{x i } followed by a correction term often stored in a small Look-Up-Table (LUT) [19] . The correction term, usually adopted when decoding binary codes (Log-MAP), can be omitted for double-binary turbo codes with minor Bit-ErrorRate (BER) performance degradation (Max-Log-MAP). The term b(e) in (1) is defined as:
where s S (e) and s E (e) are the starting and the ending states of e, α k [s S (e)] and β k [s E (e)] are the forward and backward metrics associated to s S (e) and s E (e) respectively. The term λ k [c(e)] is the intrinsic information received from the channel.
B. LDPC code decoding algorithm
LDPC codes are characterized by a very sparse M × N parity-check matrix H and valid codewords x satisfy H · x T = 0. Each code can be represented as a bipartite graph, known as Tanner Graph, containing two sets of nodes: Variable Nodes (VNs) and Check Nodes (CNs). VNs are associated to the N bits of the codeword, whereas CNs correspond to the M parity-check constraints. The most common algorithm to decode LDPC codes is the Belief Propagation (BP) algorithm. There are two main scheduling schemes for the BP: twophase scheduling and layered scheduling [20] . The latter nearly doubles the converge speed as compared to two-phase scheduling. In a layered decoder, parity-check constraints are grouped in layers each of which is associated to a component code. Then, layers are decoded in sequence by propagating extrinsic information from one layer to the following one [20] . This process is iterated up to the desired level of reliability.
Let λ[c] represent the LLR of symbol c and, for column k in H, bit LLR λ k [c] is initialized to the corresponding received soft value. Then, for all parity constraints l in a given layer, the following operations are executed:
is the extrinsic information received from the previous layer and updated in (10) to be propagated to the succeeding layer. Term R old lk , pertaining to element (l,k) of H and initialized to 0, is used to compute (6); the same amount is then updated in (9), R new lk , and stored to be used again in the following iteration. In (7) and (8) N (l) is the set of all bit indexes that are connected to parity constraint l.
According to [21] , the Ψ(·) function in (7) and (9) can be simplified with a limited BER performance loss as
usually referred to as normalized-min-sum approximation, where δ lk = σ · δ lk and σ ≤ 1.
In a parallel decoder, the decoding algorithms summarized in previous paragraphs are partitioned among P PEs. When configured in turbo code mode, these PEs operate as concurrent SISOs, while they execute (6) to (10) in parallel for P parity check constraints when configured in LDPC code mode. In both cases, messages are exchanged among PEs to propagate λ [c] amounts in accordance with the code structure. In the following, we indicate the j-th message received and generated by PE i as λ i,j and λ i,j respectively.
III. DESIGN OF THE NOC ARCHITECTURE
Turbo and LDPC decoding have in common a complex message passing phase, which varies in terms of duration and intertwining with the parallelism of the decoder. In this section we study the characteristics of NoC architectures requested to support both turbo and LDPC decoding. To this purpose, we start from results presented in [17] for NoC-based decoding of turbo codes alone and extend them towards inclusion of LDPC decoding. We also assume the same node architecture detailed in [17] and shown in Fig. 1 : each node in the network is made of a Routing Element (RE), a Processing Element (PE) and a memory (MEM) to store the incoming messages. The RE is based on an F × F crossbar switch with F input FIFOs and F output registers. t i,j represents the memory location where λ i,j will be stored. In this work we concentrate on the two most promising node architectures proposed in [17] : the All-Precalculated (AP) and the Partially-Precalculated (PP) architectures. The AP architecture makes use of offline simulations to compute the routing information of each node and to store it in a routing memory. Since the routing 
information is precalculated, very complex routing algorithms can be employed to compute the routing information and this allows to reduce the depth of input FIFOs. Moreover, this solution does not require any kind of header in the packet structure, reducing the width of the input FIFOs. However, as pointed out in [13] , the AP architecture requires additional memories to store the routing information of all supported codes. In PP architecture the routing is performed on-line by a routing algorithm: only t i,j sequences are precalculated, while destination node identifiers are included in the packet header. Details on PE architectures will be given in Section IV.
A. NoC analysis and simulation tool
The SystemC simulator developed in [17] is here used to extensively analyze the performance of NoC-based LDPC decoder architectures, in terms of throughput and memory requirements. A set of parameters is defined to take into account a large number of possible design choices including routing algorithms, node architectures and packet structures. The simulator requires the description of the NoC topology, i.e. the number of nodes and their links, then, it derives the communication pattern among the nodes of the network.
In order to evaluate the performance of an NoC-based LDPC decoder, a pre-processing tool has been developed to produce equivalent interleavers. Indeed H, the parity check matrix of the LDPC code, can be transformed in a turbo-like interleaver once the decoding scheduling and the topology are chosen. The following flow has been employed to analyze the performance of NoC-based LDPC decoder architectures.
• The first step is the definition of the graph representation of the H matrix. Size and structure of this graph depends on the chosen scheduling. With the layered decoding approach, the resulting graph has M nodes, and an arc between row-nodes i and j is defined when a non-zero entry is present on the same column of both i and j.
• The second step is the choice of the NoC topology and its degree of parallelism. To this purpose, a set T of various topologies, including mesh, toroidal mesh, spidergon, honeycomb, generalized De-Bruijn and generalized Kautz has been considered.
• The problem of mapping the LDPC codes on a specific NoC is then formulated in terms of graph partitioning and solved using the Metis bundle of graph-coloring algorithms [22] . Once graph nodes are assigned to NoCnodes, the equivalent interleaver is constructed. The framework built around the Metis package checks the produced interleavers for minimum length and uniform message distribution, selecting the optimal one for each code-topology couple. As a result of these analysis steps, LDPC check nodes are partitioned among the nodes of each NoC: then, simulation is used to evaluate the number of cycles required to perform a decoding iteration with each NoC in T . Simulations are repeated for several values of the following parameters:
• Processing element output rate (R): is the number of messages produced by a PE in a clock cycle. • Routing algorithm: three different routing policies are embedded in [16] . They rely on the off-line computation of the shortest paths between nodes. This information is stored in one or more routing tables. When only one shortest path is used (one routing table) the routing algorithm is referred to as Single-Shortest-Path (SSP), whereas when more shortest paths are computed (multiple routing tables) the algorithm will be named All-localShortest-Paths (ASP). The first approach described in [17] is the SSP-Round-Robin (SSP-RR): it is based on a circling serving policy. Similarly the SSP-FIFO-Length (SSP-FL) routing algorithm is based on the current status of input FIFOs. The third approach, named ASP-FIFOlength-with-Traffic-spreading (ASP-FT), takes in account all the possible different shortest paths. The serving policy is a modified version of FL: it keeps a statistic of sent messages to spread the traffic on the network [17] .
• Delay/Send Colliding Message (DCM/SCM ): this parameter activates a collision management technique. A collision arises when two or more messages require to be routed to the same output port. In this case, if the DCM strategy is employed, the first message is routed according to the selected routing algorithm, whereas the colliding messages are kept in their FIFOs. On the contrary, if SCM is used, colliding messages will be randomly routed to one of the available output ports. Namely, the configuration of the crossbar switch is chosen to route non-colliding messages, whereas colliding messages are treated as "don't-care".
• Route Local (RL): this flag allows to choose if local messages, i.e. messages sent and received by the same PE, are routed on the network (RL = 1) or are stored in an internal queue, bypassing the routing (RL = 0).
B. Analysis of NoCs for LDPC codes
In order to show the potential of the NoC approach in the design of LDPC code decoders, the whole set of WiMAX codes has been used as a design case. In Table I 
where f clk is the clock frequency, It max is the maximum number of iterations, lat core is the maximum latency of the decoding core and n cycles is the duration of the message passing phase. Results in Table I have been obtained with (12) , imposing f clk = 300 MHz, It max = 10 and lat core = 15 cycles. The value of n cycles is measured by the means of the simulator [26] . The area occupation is the post synthesis result obtained with Synopsys Design Compiler on a 90 nm CMOS technology. These area results do not take in account the PE and the incoming message memories.
Generalized Kautz topologies outperform all the other ones in terms of both throughput and complexity in the case of LDPC codes. Moreover, D = 3 solutions give higher throughputs than D = 2 ones, whereas they are comparable to D = 4 topologies but with lower area occupation.
C. Analysis of NoCs for turbo and LDPC codes
To find the most suited NoC for both turbo and LDPC decoding according to the WiMAX standard we analyzed the results shown in [17] for turbo codes and the ones presented in Table I for LDPC codes. Generalized Kautz topologies show the best average throughput-to-area ratio both for turbo and LDPC codes and D = 3 is a good throughput/complexity trade-off. For LDPC codes the minimum value of P to achieve the 70 Mbit/s throughput required by the IEEE 802.16e Table II standard, with a 300 MHz clock frequency, is 22. On the contrary, turbo codes yield a higher than required throughput with a 22-nodes NoC. Thus, the working frequency for turbo codes can be lowered to 75 MHz and throughput is still above limit. For both codes throughput and area show a weak dependence on the routing algorithm. However, the SSP-FL routing algorithm guarantees the best average performances also with different topologies and non-WiMAX codes, thus being the best choice in terms of flexibility. A choice of the obtained results are given in Table II for N = 2400 turbo code and N = 2304, r = 0.5 LDPC code.
IV. DESIGN OF PROCESSING ELEMENT ARCHITECTURES
Each PE includes two distinct decoding cores.
A. LDPC decoding core
For LDPC decoding, the PE must be structured so that all the block lengths and code rates imposed by the standard are supported. A simple and effective architecture based on a sequential processing has been designed. Fig. 2 shows the architecture of the LDPC decoding core: λ 
B. Turbo decoding core
The proposed solution for the turbo decoding core is the SISO architecture in Fig. 3 . Since the turbo code used in the WiMAX standard is double-binary each message λ i,j is a vector of three elements. According to [23] , sending bit-level (BL) instead of symbol-level extrinsic information reduces the NoC complexity of roughly 1/3. Resorting to the solution proposed in [24] this complexity reduction comes at the expense of a 0.2 dB BER loss. A dedicated unit, the BitTo-Symbol Conversion Unit (BTS CU) converts the incoming a priori values from bit (BL λ 
V. RESULTS
As a case of study, a complete turbo/LDPC decoder for the WiMAX standard has been implemented. Table III shows the pre-layout synthesis results (2 nd row), obtained with Synopsys Design Compiler on a 90 nm deep sub-micron CMOS technology [25] , together with recent state-of-the-art dual code decoders. Where possible, worst-case throughput and the relative code are reported. For the sake of fairness it is worth noting that [5] and [9] support both WiMAX and LTE modes. In this work the SISO/LDPC core shared memories account for 61.8% of the processing core area: SISO-exclusive logic contributes for 18.6%, while LDPC core-exclusive logic occupies the remaining 19.6%. Comparison with [9] shows similar core area occupation, whereas our NoC contributes for 0.61 mm 2 , about the 20% of the total area occupation. Its larger complexity is mainly due to the more distributed topology and complex node architecture. Both LDPC and turbo cases in this work show compliance with the WiMAX standard throughput requirements. LDPC codes are considered with R = 0.5, and a clock frequency of 300 MHz for both the NoC and the LDPC core. The worst case values, obtained for the N = 2304, r = 0.5 code, are still above 70 Mb/s: in [9] , according to the provided formula, for the same code throughput is below the standard threshold. The best working conditions for turbo decoding are with R = 0.33. Since the designed SISO architecture produces two λ k [u] every three clock cycles, it must run at half the clock frequency of the NoC: f This characteristic, together with the lower memory accesses rate of turbo decoding, results in a large power reduction w.r.t. LDPC decoding. It is worth noting that, in the turbo decoding mode the proposed architecture achieves the lowest power consumption as compared with [5] [6] [7] [8] [9] .
The architecture in [5] not only supports both WiMAX and LTE modes but it also features a very small area occupation. However, it does not reach a high enough throughput for the WiMAX standard, while our decoder has both smaller area and higher throughput than [7] . Area occupation is smaller than [6] , but throughput analysis is difficult, since standard compliance is stated but no minimum values are reported. The architecture for WiMAX/WiFi LDPC codes and 3GPP-LTE turbo code presented in [8] runs at 500 MHz and achieves the highest throughput among compared architectures with the same complexity as our architecture. A fair comparison is not possible as WiMAX turbo code is not addressed. The proposed decoder guarantees compliance with WiMAX, but is not limited to its codes: the SISO can work with any 8 state Double-Binary-Turbo-Code (DBTC), whereas the LDPC core can sustain any code smaller than 802.16e ones (e.g. WiFi).
VI. CONCLUSIONS
The design of a fully flexible NoC based turbo/LDPC decoder is presented, together with custom simulation software models for a thorough analysis of the NoC architecture. The proposed decoder implementation offers an unmatched degree of flexibility and full compliance with the WiMAX standard, guaranteeing the highest worst-case throughput and small area occupation compared to the latest state-of-the-art solutions, together with particularly low power consumption in turbo mode.
