Abstract-In this work novel results concerning Network-onChip-based turbo decoder architectures are presented. Stemming from previous publications, this work concentrates first on improving the throughput by exploiting adaptive-bandwidthreduction techniques. This technique shows in the best case an improvement of more than 60 Mb/s. Moreover, it is known that double-binary turbo decoders require higher area than binary ones. This characteristic has the negative effect of increasing the data width of the network nodes. Thus, the second contribution of this work is to reduce the network complexity to support doublebinary codes, by exploiting bit-level and pseudo-floating-point representation of the extrinsic information. These two techniques allow for an area reduction of up to more than the 40% with a performance degradation of about 0.2 dB.
I. INTRODUCTION
Today, modern telecommunications are a pervasive experience of data exchange among users and devices. One critical aspect of this scenario is the continuous demand for higher data rates, a problem that is exacerbated by the need for reliable transmission of data. To that purpose, the push on the so-called beyond-3G technologies, such as WiMAX [1] and 3GPP-LTE [2] , is a possible answer, where the reliability is obtained exploiting effective error correcting codes, such as turbo [3] and LDPC [4] codes. Unfortunately, the decoding algorithms for these codes are iterative making high throughput implementations a challenging task [5] , [6] .
As shown in Table I in [7] , several modern standards for communications use turbo codes as a reliable channel coding scheme. However, since these codes have limited similarities, flexible architectures able to support different standards are interesting solutions to achieve interoperability [8] . This direction has been investigated in several works [9] - [15] where not only flexibility but also high throughput, achieved by the means of parallel architectures, is addressed. As an example [11] , [12] , [15] deal with optimized ASIC architectures where the flexibility is limited to two standards, UMTS/WiMAX, 3GPP-LTE/WiMAX and 3GPP-LTE/HSDPA respectively. On the other hand, [9] , [10] , [13] , [14] are based on the ASIP approach, where optimized processor-like architectures are used. It is worth observing that ASIP-based solutions allow for greater flexibility than ASIC-based architectures, as they can support several different codes and standards. Moreover, as suggested in [13] , ASIP solutions are well suited to implement high throughput multiprocessor turbo decoder architectures [7] .
The authors are with Dipartimento di Elettronica -Politecnico di TorinoItaly.
Recently, in [16] we introduced the concept of intra-IP Network-on-Chip (NoC), where the well known NoC paradigm is applied to the communication structure of processing elements that belong to the same IP. As discussed in several works, such as [7] , [17] - [20] , intra-IP NoC is a flexible solution to enable multi-ASIP turbo decoder architectures. However, as shown in detail in [7] , [20] , flexibility comes at the expense of increasing the complexity of the decoder architecture. In this work we improve the complexity/performance trade-off of NoC-based turbo decoder architectures by reducing the traffic load on the network as suggested in [21] . The adopted technique of traffic reduction offers in the best case a throughput improvement of more than 60 Mb/s and 40 Mb/s for binary and double-binary codes respectively. Furthermore, we exploit two known techniques [22] , [23] , originally proposed to limit the amount of memory in turbo decoder architectures, as possible solutions to reduce the complexity of the NoC when double-binary turbo codes [24] are employed, as in the WiMAX standard. The paper is structured as follows: in section II we recall the equations required to implement the decoding algorithm, whereas in section III we describe the peculiar characteristics of an NoC-based turbo decoder architecture, including the architecture of routing elements, low-complexity routing algo- rithms and topologies. Section IV describes the experimental setup we defined to increase the throughput and reduce the area of NoC-based turbo decoder architectures both in the case of binary and double-binary codes. To this purpose we considered the HSDPA and the 3GPP-LTE standards for the case of binary codes, and the WiMAX standard for the case of double-binary codes. Finally, in section V conclusions are drawn.
II. DECODING ALGORITHMS
Since turbo codes are based on the concatenation (usually parallel) of two constituent Convolutional Codes (CC) ( Fig.  1 (a) ), the decoder is made of two constituent decoders that exchange their data by means of an interleaver (Π) and a deinterleaver (Π −1 ), see Fig. 1 (b) . For the sake of brevity in the next paragraph we define the symbols used in Fig. 1 (a) and (b) without specifying if they are related to CC1 or CC2.
The decoding algorithm of turbo codes is an iterative process made of two half iterations, one for each constituent decoder, where each half iteration is based on Maximum-APosteriori (MAP) estimation achieved by means of the BCJR algorithm [25] , where Log-Likelihood-Ratio (LLR) representation is usually adopted [26] . Based on the trellis notation shown in Fig. 1 (c) and said U the set of uncoded symbols, each constituent MAP decoder, often referred to as Soft-InSoft-Out (SISO) module, computes whereũ ∈ U is a uncoded symbol taken as a reference (usuallyũ = 0), u ∈ U \ {ũ}, k is a trellis step, e is a transition in a trellis step and u(e) is the corresponding uncoded symbol. Thus, λ are extrinsic and a-priori information respectively for symbol u at trellis step k expressed as LLRs. The * max{x i } function is implemented as max{x i } followed by a correction term often stored in a small Look-Up-Table (LUT) [27] , [28] . The correction term, usually adopted when decoding binary codes (Log-MAP), can be omitted for double-binary turbo codes with minor error rate performance degradation (Max-Log-MAP).
The term b ext (e) in (1) is defined as: In a parallel decoder P SISOs operate concurrently on disjoint portions of the trellis. Said N the number of trellis steps processed by each constituent decoder, we have that each SISO operates on a trellis slice made of N/P steps. As a consequence, we can extend the notation introduced in the previous paragraph to a parallel decoder, where λ the extrinsic information produced by SISO i at the j-th trellis step. For further details on the decoding algorithm the reader can refer to [5] .
III. NOC-BASED TURBO DECODER ARCHITECTURES
An NoC-based turbo decoder architecture can be represented as a graph where each node is made of a Routing Element (RE) and a Processing Elements (PE) (see Fig. 2 ). Each PE, devoted to perform the processing required by the BCJR algorithm, contains a SISO processor and two memories where intrinsic and a-priori information are stored respectively. On the other hand, each RE has a simple structure made of M input buffers (FIFOs), an M × M crossbar switch and M output registers. REs are devoted to route the data produced by PEs to the correct destination node according to Π and Π −1 . To this purpose we introduce d(i, j) as the destination node of λ ext i,j [u] . In order to complete a half iteration, λ ext i,j [u] is stored at the location t(i, j) in the a-priori information memory of node d(i, j).
In general PEs and REs can operate at different rates, thus, to decouple the design of PEs and REs we define R as the number of packets injected in the network in a clock cycle. As a consequence, R = 1 means that each PE injects in the network one new packet per clock cycle, whereas R = 0.5 means that a new packet is injected in the network every two clock cycles. It is worth noting that the case R = 1 corresponds to REs and PEs working at the same clock frequency (isochronous), with PEs able to output new packet of extrinsic information at each clock cycle. On the contrary, R < 1 models either an isochronous system where PEs output less that one packet per clock cycle or a mesochronous system where REs work at a higher clock frequency that PEs.
A. RE architectures
In [20] three possible architectures for REs (see Fig. 2 ), referred to as Fully-Adaptive (FA), All Precalculated (AP) and Partially Precalculated (PP) architectures were presented.
The FA architecture ( Fig. 2 (a) ) sends on the network packets of data made of a header, containing d(i, j) and a payload containing λ ext i,j [u] and t(i, j). The data are routed by the means of a Routing Algorithm (RA).
The AP architecture ( Fig. 2 (b) ) is obtained observing that: given Π and Π −1 we have
where ⌊·⌋ is the next lowest integer value and Θ(·) can be either Π(·) or Π −1 (·) depending on the current half iteration. As a consequence, for each node we can precalculate and store in a Routing Memory (RM) and in a Location Memory the routing information andt(i, j), the location where the received valueλ ext i,j [u] will be stored, respectively. Thus, with the AP architecture we reduce the width of the data bus at the expense of some extra memory.
The PP architecture ( Fig. 2 (c) ) only precalculates thet(i, j) sequences thus, it requires a narrower data width than the FA architecture, but less memory than the AP one.
To improve the throughput/area figures of NoC-based turbo decoder architecture we infer from [20] two main results:
• The AP architecture can be conveniently used with complex routing algorithms to concurrently maximize the throughput and minimize the area. Unfortunately, as pointed out in [7] this comes at the expense of a significant amount of external memory to store the routing information; as an example to support all the interleavers specified by the HSDPA standard [29] about 64 MB of memory are required.
• As long as the network is faster than the PEs (R < 1), throughput and area figures tend to be independent of the routing algorithm. Thus, both FA and PP architectures with simple RAs, should be further investigated. In particular, the performance of the FA architecture can be improved by using Adaptive Bandwidth Reduction (ABR) techniques as the one proposed in [21] , namely avoiding the exchange of unnecessary extrinsic information values. This distinguishing feature of the FA architecture, that is not available with AP and PP architectures, is detailed in section IV-A. On the contrary, the PP architecture features a narrower data bus than the FA one, however, it requires some external memory to store the configurations of all the Location Memories. Moreover, in several standards, such as HSDPA, 3GPP-LTE and WiMAX, the generation of d(i, j) and t(i, j) sequences can be obtained algorithmically with simple architectures [12] , [30] , [31] . As a consequence, the FA architecture can also take advantage of this feature to reduce the complexity of the whole decoder.
B. Low complexity RAs
In order to increase the throughput and reduce the area of the decoder, RAs should be based on simple, deadlockfree routing policies than can be implemented with few logic and completed in one clock cycle. As suggested in [20] Round-Robin (RR) and FIFO-length (FL) are suitable policies for NoC-based turbo decoders. RR is based on a circular serving policy, whereas with FL policies each input is served considering the number of elements stored in its input buffer, namely FL sorts the input buffers according to the number of stored elements, then it serves them in decreasing order. Routing paths are stored into a routing table: for each couple of nodes in the network, one shortest-path is stored in the routing table. This approach, where only one shortest-path is considered, will be referred to as Single-Shortest-Path (SSP) [20] in the rest of the paper.
C. NoC topologies
In [20] several fixed degree topologies for NoC-based turbo decoder architectures are considered. However, since Π and Π −1 tend to spread almost uniformly λ ext i,j [u], the traffic pattern on the network is almost uniform too. Experimental results in [20] show that topologies with logarithmic diameter as generalized De-Bruijn [32] and generalized Kautz [33] achieve higher throughput and require lower area than other well known fixed degree topologies such as ring, honeycomb and toroidal-mesh ones.
IV. EXPERIMENTAL SETUP
Since in this work we aim at increasing the throughput and reducing the area of NoC-based turbo decoder architectures, we focus on the most significant cases discussed in section III, namely FA node architecture with SSP-RR and SSP-FL routing algorithms. Moreover, we consider only generalized Kautz topologies, as they have logarithmic diameter and less self-loops 1 than generalized De-Bruijn ones [20] , [32] , [33] . The degree of the network D = M − 1 ranges in {2, 3, 4} and the parameter R varies in {0.33, 0.5, 1}. Then we simulated both HSDPA and 3GPP-LTE interleavers for the case of binary 1 If we model a topology as a graph, a self-loop is an edge whose source and destination nodes coincide. turbo codes. Furthermore, we simulated the double-binary turbo code used in the WiMAX standard as well.
In the following the throughput is computed as
where N b is the number of decoded bits, f clk is the clock frequency, I is the number of iterations, N 
A. ABR in NoC-based turbo decoder architectures
According to [21] the throughput of an NOC-based turbo decoder can be increased by reducing the amount of data injected into the network. This approach is similar to well known early stopping criteria that are routinely used to both increase the throughput and reduce the power consumption in turbo decoder architectures [35] . However, most of related works focus on frame-level early stopping criteria. On the contrary, bit-level/symbol-level early stopping criteria [36] take into account that the reliability of each bit/symbol in a frame converges at different speed. As a consequence, when the extrinsic information of a certain bit/symbol meets a proper reliability criterion, it is not necessary to further refine it. From an NoC-based turbo decoder perspective, this means that reliable λ ext i,j [u] are no longer sent over the network.
B. HSDPA and 3GPP-LTE case of study
For binary turbo codes, as the ones employed in HSDPA and 3GPP-LTE standards, a simple ABR technique is obtained by fixing a threshold K that is compared with δ = |λ
is not sent. The choice of K depends not only on the specific code considered but also on the quantization parameters used to represent λ ext i,j [u] and on the performance loss in terms of Bit-Error-Rate (BER) that can be accepted. In the following we consider N = 5114 for HSDPA and N = 6144 for 3GPP-LTE respectively. In both cases the extrinsic information is represented on eight bits whereas the intrinsic information is represented on six bits with three fractional bits. Both decoders perform eight iterations (I = 8) with P = 64 using the Log-MAP algorithm [27] with a LUT-stored correction term. In Fig. 3 and 4 we show the BER performance for the HSDPA and 3GPP-LTE codes respectively obtained by applying the ABR technique described in the previous paragraph with several values 2 for K. In particular, in Fig. 3 we show for the HSDPA code that when K > 10 the performance worsens significantly. As an example, with K = 10 there is a performance loss of less than 0.1 dB in the waterfall region and nearly ideal performance when the code floors. On the other hand, with K = 16 the performance loss is of about 0.2 dB in the waterfall region and the code floor is shifted to higher SNR values of about 0.2 dB as well. Similar results were observed for the 3GPP-LTE code, so, for the sake of clarity, in Fig. 4 only results obtained with K = 4, 6, 8, 10 are shown. For both cases we obtained the corresponding best and average bandwidth reduction at different SNR values through Monte Carlo simulations 3 . Experimental results show that the throughput increase is significant when there is a high load on the network (R = 1) either using FL or RR routing algorithm. In particular, in Fig. 5, 6 and 7 we show the average throughput increase for the HSDPA turbo decoder for D = 2, 3, 4 respectively with different values of K. As it can be observed when R = 1 there is an average throughput increase, with respect to a decoder where ABR is not applied, that ranges from about 5 to 20 Mb/s for the HSDPA turbo decoder. Furthermore, we observed that in the best case there is a throughput increase of at least 60 Mb/s. On the other hand, when R < 1 the average throughput improvement is at most of 5 Mb/s. Similar results have been obtained for the LTE turbo decoder.
To complete the comparison, we show in Table I the throughput/area results for the HSDPA and LTE cases respectively, where the results for the HSDPA case with ASP-FT routing algorithm and AP node architecture are taken from [7] . As it can be observed the significant throughput increase obtained with the ABR technique on the FA node architecture when R = 1 is paid as an area overhead with respect to the AP node architecture. However, as pointed out in [7] , the AP node architecture requires a large external memory to store the routing information. Moreover, the difference in terms of area between FA and AP node architectures reduces when R < 1. In particular, as shown in Table I , when R = 0.33 with P = 8 and P = 16 the FA node architecture with the SSP-FL routing algorithm requires less area than the AP one.
C. WiMAX case of study
Simulation results shown in this section have been obtained with N = 1920, as in [23] . Each component of the extrinsic information is represented on eight bits whereas the intrinsic information is represented on six bits with two fractional bits. The decoder performs eight iterations (I = 8) with P = 64 using the Max-Log-MAP algorithm [27] .
Since in binary turbo codes U = {0, 1}, the LLR of the extrinsic information is a scalar value. On the other hand, for double-binary turbo codes U = {00, 01, 10, 11}, as a consequence λ ext i,j [u] is an array containing three elements. In [23] , a bit level double-binary turbo decoder architecture is proposed to reduce the amount of memory to store the extrinsic information. The same idea is exploited in this work to reduce the area overhead of the NoC. Basically, a doublebinary uncoded symbol u can be represented as a couple of binary random variables AB. Then, with a slight abuse of notation, said X a binary random variable, we denote X = 0 with X and X = 1 with X. Resorting to the Max-Log-MAP approximation we can convert Symbol-Level (SL) LLRs to Bit-Level (BL) LLRs as where
Similarly, we can convert BL LLRs to SL LLRs with the following approximations. 
3) (27) where
For further details on bit to symbol and symbol to bit conversion the reader can refer to [23] . The use of BL LLRs introduces a BER performance loss of about 0.2 dB (see Fig. 8 ), but it reduces the data width of one third with respect to SL LLRs, as the payload of each packet contains λ [u] . To further reduce the data width we applied to BL LLRs the Pseudo-Floating-Point (PFP) representation suggested in [22] . As highlighted also in [37] , [38] . Then, according with [22] , we impose
Said n λ , n ξ and n σ the number of bits to represent λ, ξ and σ respectively we obtaiñ
where >> stands for arithmetic right shift. As a consequence, the payload of each packet sent on the network now contains . If we impose n ξ = 4, we obtain σ i,j ≤ 4 and so n σ = 3, leading to n d = 2n ξ + n σ = 11 that is less than half the value of n d for λ ext i,j [u] . As shown in Fig. 8 the BER performance loss of BL, PFP LLR representation, is nearly the same as the fixed point BL one. In Table II the throughput and area results obtained by using SL and BL, PFP LLR representation are shown for generalized Kautz topologies. As it can be observed, the area decrease as a function of n d is not linear, however, it becomes particularly interesting when R = 1. As an example, with R = 1, D = 4 and P = 64 there is an area saving of up to the 40%.
The techniques described in the previous paragraphs are all aimed at reducing the area of the NoC-based turbo decoder. Furthermore, the ABR technique described in section IV-A can be used to improve the throughput as well. In order to limit the BER performance loss introduced by the ABR technique, we employ the SL reliability criterion proposed in [21] but we send BL, PFP extrinsic information when the criterion is not met. The ABR technique we used is summarized in Algorithm 1 and can be summarized as follows: said ϑ As shown in Fig. 8 the BER performance loss introduced by the ABR technique is negligible. Moreover, as shown in Fig. 9 , 10 and 11 when R = 1 the ABR technique induces an average throughput increase of about 5 to 20 Mb/s. Similarly to the binary codes in the best case the throughput improvement is at least of more than 40 Mb/s, whereas when R < 1 the average throughput improvement is at most of 5 Mb/s.
V. CONCLUSIONS
In this work ABR techniques have been exploited to improve the throughput of NoC-based turbo decoder architectures. When the load of the network is high the average throughput is improved of about 5 to 20 Mb/s and in the best case the throughput is increased of more than 60 Mb/s and 40 Mb/s for binary and double-binary codes respectively. Moreover, the area required to support double-binary codes has been significantly reduced (up to more than the 40%) by applying BL, PFP representation of the extrinsic information with a BER performance loss of about 0.2 dB.
