Turbo NOC: a framework for the design of Network On Chip based turbo decoder architectures

Maurizio Martina, Member IEEE, Guido Masera, Senior Member IEEE

Abstract—This work proposes a general framework for the design and simulation of network on chip based turbo decoder architectures. Several parameters in the design space are investigated, namely the network topology, the parallelism degree, the rate at which messages are sent by processing nodes over the network and the routing strategy. The main results of this analysis are: i) the most suitable topologies to achieve high throughput with a limited complexity overhead are generalized de-Bruijn and generalized Kautz topologies; ii) depending on the throughput requirements different parallelism degrees, message injection rates and routing algorithms can be used to minimize the network area overhead.

Index Terms—Turbo Decoder, Network on Chip, VLSI

I. INTRODUCTION

In the last years wireless communication systems coped with the problem of delivering reliable information while granting high throughput. This problem has often been faced resorting to channel codes able to correct errors even at low signal to noise ratios. As pointed out in Table I in [1], several standards for wireless communications adopt binary or double binary turbo codes [2], [3] and exploit their excellent error correction capability. However, due to the high computational complexity required to decode turbo codes, optimized architectures (e.g. [4], [5], [6], [7], [8]) have been usually employed. Moreover, several works addressed the parallelization of turbo decoder architectures to achieve higher throughput [9], [10]. In particular, many works concentrate on avoiding, or reducing, the collision phenomenon that arises with parallel architectures (e.g. [11], [12], [13], [14]).

Although throughput and area have been the dominant metrics driving the optimization of turbo decoders, recently, the need for flexible systems able to support different operative modes, or even different standards, has changed the perspective. In particular, the so called software defined radio (SDR) paradigm made flexibility a fundamental property [15] of future receivers, which will be requested to support a wide range of heterogeneous standards. Some recent works (e.g. [1], [16], [17]) deal with the implementation of Application-Specific Instruction-set Processor (ASIP) architectures for turbo decoders. In order to obtain architectures that achieve both high throughput and flexibility multi-ASIP is an effective solution. Thus, together with flexible and high throughput processing elements, a multi-ASIP architecture must feature also a flexible and high throughput interconnection backbone. To that purpose, the Network-On-Chip (NOC) approach has been proposed to interconnect processing elements in turbo decoder architectures designed to support multiple standards [18], [19], [20], [21], [22], [23]. Current approaches to multi-standard turbo decoders pragmatically aim at implementing a “just enough flexibility”, by supporting interleavers and throughput requirements specified in some of the current standards. On the contrary, this paper addresses a “full flexibility”, capable of supporting high throughput implementation of any interleaver. In fact, even if parallel collision-free turbo decoder architectures exist, it is difficult to obtain a flexible architecture that is able to manage in a collision-free manner different turbo codes. In fact, some codes lead to parallel collision-free architectures only for specific values of the parallelism degree. As an example with WiMax turbo codes the collision-free feature can be obtained with proper selection of the decoder parallelism degree as a function of the data-frame size [24]. As a consequence, given a parallelism degree and a set of turbo codes, the NOC approach is a suitable solution to support the required flexibility. In addition, NOC based turbo decoder architectures have the intrinsic feature of adaptively reducing the communication bandwidth by the inhibition of unnecessary extrinsic information exchange. This can be obtained by exploiting bit-level reliability-based criteria where unnecessary iterations for reliable bits are avoided [25].

In [18], [19], [20] ring, chordal ring and random graph topologies are investigated whereas in [21] previous works are extended to mesh and toroidal topologies. Furthermore, in [22] butterfly and Benes topologies are studied, and in [23] binary de-Bruijn topologies are considered. However, none of these works presents a unified framework to design a NOC based turbo decoder, showing possible complexity/performance trade-offs. This work aims at filling this gap by providing a comprehensive study of NOC based turbo decoders, conducted by means of a dedicated NOC simulator and a list of obtained results, showing the complexity/performance trade-offs offered by different topologies, routing algorithms, node and ASIP architectures. Moreover, from previous studies it is not completely clear if the area overhead of NOC based turbo decoders is acceptable. This work tries to further investigate this aspect by giving real post synthesis results for a 130 nm standard cell technology.

The paper is structured as follows: in section II the requirements and characteristics of a parallel turbo decoder architecture are analyzed, whereas in section III NOC based approach is introduced. Section IV summarizes the topologies considered in previous works and introduces generalized de-Bruijn and generalized Kautz topologies as promising solutions for NOC based turbo decoder architectures. In section V three main routing algorithms are introduced, whereas in
section VI the Turbo NOC framework is described. Section VII describes the architecture of the different routing algorithms considered in this work, section VIII presents the experimental results and section IX draws some conclusions.

II. SYSTEM REQUIREMENT ANALYSIS

A parallel turbo decoder can be modeled as $P$ processing elements that need to read from and write to $P$ memories. Each processing element, often referred to as soft-in-soft-out (SISO) module, performs the BCJR algorithm [26], whereas the memories are used for exchanging the extrinsic information $\lambda$ among the SISOS. The decoding process is iterative and usually each SISO performs sequentially the BCJR algorithm for the two constituent codes used at the encoder side; for further details on the SISO module the reader can refer to [27]. As a consequence, each iteration is made of two half iterations referred to as interleaving and de-interleaving. During one half iteration the extrinsic information produced by SISO $i$ at time $j$ ($\lambda_{i,j}$) is sent to the memory $k$ at the location $t$, where $k = k(i, j)$ and $t = t(i, j)$ are functions of $i$ and $j$ derived from the permutation law (II or interleaver) employed at the encoder side. Thus, the time required to complete the decoding is directly related to the number of clock cycles necessary to complete a half iteration. Without loss of generality, we can express the number of cycles required to complete a half iteration ($hi$) as

$$N_{hi} = \frac{N}{P \cdot R} + IL$$

(1)

where $N$ is the total number of trellis steps in a data frame, $N/P$ is the number of trellis steps processed by each SISO, $R$ is the SISO output rate, namely the number of trellis steps processed by a SISO in a clock cycle, and $IL$ is the interconnection structure latency. Thus, the decoder throughput expressed as the number of encoded bits over the time required to complete the decoding process is

$$T = \frac{d \cdot N \cdot f_{clk}}{2I \cdot N_{hi}} = \frac{d \cdot N \cdot f_{clk}}{2I \cdot (\frac{N}{P \cdot R} + IL)}$$

(2)

where $f_{clk}$ is the clock frequency, $I$ is the number of iterations, $d = 1$ for binary codes and $d = 2$ for double binary codes. When the interconnection structure latency is negligible with respect to the number of cycles required by the SISO, we obtain

$$T \approx \frac{d \cdot P \cdot R}{2I} \cdot f_{clk}$$

(3)

Thus, to achieve a target throughput $\hat{T}$ and satisfactory error rate performance, a proper number $I$ of iterations should be used. The minimum $P$ ($P_m$) to satisfy $\hat{T}$ with $I$ iterations can be estimated from (3) for some ASIP architectures available in the literature. If we consider $I = 5$, as in [1], [17], $P$ ranges between 5 and 37 to achieve $\hat{T} = 200$ Mb/s (see Table I). It is worth pointing out that the $C = (R \cdot d)^{-1}$ values in Table I represent the average numbers of cycles required by the SISO to update the soft information of one bit (see Table VI in [1] and Table I in [17]). Moreover, $C$ strongly depends on the internal architecture of the SISO and in general tends to increase with the code complexity. As a consequence, several conditions can further increase $P$, namely 1) interconnection structures with larger $IL$; 2) higher $(R \cdot d)^{-1}$ values; 3) higher $\hat{T}$; 4) higher $I$; 5) lower clock frequency. Thus, we consider as relevant for investigation a slightly wider range for $P$: $P \in \{8, 16, 32, 64\}$.

### Table I

<table>
<thead>
<tr>
<th>Architecture</th>
<th>Technology [nm]</th>
<th>$f_{clk}$ [MHz]</th>
<th>$C = (R \cdot d)^{-1}$</th>
<th>$P_m$</th>
</tr>
</thead>
<tbody>
<tr>
<td>[1]</td>
<td>65</td>
<td>400</td>
<td>2.35</td>
<td>6</td>
</tr>
<tr>
<td>[17]</td>
<td>90</td>
<td>400</td>
<td>1.75</td>
<td>5</td>
</tr>
<tr>
<td>[28]</td>
<td>90</td>
<td>335</td>
<td>6.5</td>
<td>20</td>
</tr>
<tr>
<td>[28]</td>
<td>180</td>
<td>180</td>
<td>6.5</td>
<td>37</td>
</tr>
</tbody>
</table>

III. NETWORK BASED APPROACH

The NOC approach [29] has been proposed as a general methodology to interconnect heterogeneous intellectual properties (IP) in complex systems on chip (inter-IP interconnection). Recent works deal with methodologies to design application specific NOCs (e.g. [30]) where the NOC is tailored around a particular application or group of applications. In this scenario, turbo decoder architectures are a common IP required in physical layer chips for modern communication standards. In this work, as in some previous papers, e.g. [18], [21], [23] we concentrate on the problem of interconnecting the main building blocks of a parallel turbo decoder, namely we focus on the intra-IP interconnection problem [31], and we do not deal with the general problem of connecting the turbo decoder IP to other receiver modules through an inter-IP interconnection network. To that purpose, it is worth pointing out that statistical characterization of communication patterns, which is one of the most relevant aspects in the design of application specific NOCs, is not required in turbo decoders, as communication patterns depend on $P$. As a consequence, given a set turbo codes with the corresponding $I$ laws, the intra-IP communication patterns are deterministic. Thus, the challenge of NOC based turbo decoder architectures is to find one or more sets of parameters that match throughput constraints for all supported standards with a reduced complexity overhead. This set of parameters includes $R$, $P$, the topology and the routing algorithm. Two interesting results, developed for a generic application in [32], can be used as general guidelines to design NOC based turbo decoder architectures. i) For short messages and maximum workload ($R = 1$) NOC is an effective solution compared to bus based implementations when $P$ grows. ii) NOC is still an effective solution compared to a bus based implementation even when $R < 1$, given that $P > 25$. Moreover, as highlighted in section II and in [17], to achieve a target throughput of several hundreds of Mb/s a realistic value for $P$ is of some tens. Thus, as shown in [21] interconnection structures that permit concurrent interleaving are mandatory for this scenario, making NOC a viable solution.

A NOC based turbo decoder architecture relies on $P$ nodes connected through a proper topology where the extrinsic information is sent over the network according to a certain
routing algorithm. We assume that each node has a certain number of input and output ports ($M$). As highlighted in [21], node architectures based on input-queuing lead to a fairly low complexity. As a consequence we assume that each node has a FIFO for each input, a crossbar to connect each input FIFO to a proper output and an output register, as shown in Fig. 1. Furthermore, each node has a local SISO (SISO $i$) that sends extrinsic information over the network through the $M - 1$ labeled input port and a local memory (MEM $i$) that receives extrinsic information from the network through the $M - 1$ labeled output port.

Three possible node architectures, shown in Fig. 1, have been considered in this work, namely the Fully Adaptive node architecture (FA) and two simplified versions, All Precalculated node architecture (AP) and Partially Precalculated node architecture (PP).

a) Fully adaptive node architecture: In each half iteration a SISO sends $N/P$ messages where every message is made of a payload containing the extrinsic information $\lambda_{i,j}$ and the location of the memory where the extrinsic information will be written ($t(i,j)$), and a header containing the identifier of the destination node ($k(i,j)$). As a consequence, the node should contain a memory to store $k(i,j)$ (Identifier Memory), a memory to store $t(i,j)$ (Location Memory) and a routing algorithm to properly route messages through the network (see Fig. 1 (a)).

b) All-precalculated node architecture: Since the permutation law defined by the interleaver is known a-priori, the path followed by a message during an interleaving (or de-interleaving) half iteration can be precalculated and stored as a routing information into a routing memory for each node. This approach reduces the data width of FIFOs, crossbars and registers as neither $k(i,j)$ nor $t(i,j)$ are sent over the network. The location where received messages ($\lambda'_{i,j}$) will be stored ($t'(i,j)$) can be also precalculated and stored into a Location Memory (see Fig. 1 (b)).

c) Partially-precalculated node architecture: Since the routing memory footprint can be relevant, a hybrid solution is obtained by precalculating and storing only $t'(i,j)$, whereas the routing is managed by a routing algorithm (see Fig. 1 (c)). This solution does not require a Routing Memory and employs a smaller payload with respect to the FA node architecture. On the other hand, the FA node architecture directly supports adaptive bandwidth reduction techniques, whereas, neither the AP nor the PP node architectures do. As an example the FA node architecture can be employed to implement the bit-level reliability-based criterion proposed in [25], where unnecessary iterations for reliable bits are avoided. In such a case, the extrinsic information related to reliable bits is not sent on the network leading to bandwidth reduction. As a consequence, this bandwidth reduction leads to increase the throughput of the NOC based turbo decoder.

IV. TOPOLOGIES

As highlighted in [21] and [23] the choice of topologies and routing algorithms impacts both on throughput and complexity. As a consequence, given a certain parallelism degree $P$, topologies with a small node degree ($D = M - 1$) as rings ($D = 2$) keep the network complexity overhead limited. On the other hand, topologies with a higher node degree as toroidal meshes ($D = 4$) can increase the throughput. Since interleavers tend to spread the extrinsic information almost uniformly among the $P$ memories, we consider fixed degree topologies where every node has the same degree $D$. Among fixed degree topologies we included rings and toroidal mesh networks as in [18] and [21]. Moreover, de-Bruijn topologies are known to show shorter worst case distance among nodes than toroidal meshes: therefore, they are good candidates to reduce the latency of the network in turbo decoder architectures, as investigated in [23]. A de-Bruijn topology is made of nodes labeled by an array of $n$ elements, each element is taken from an alphabet $\mathcal{A}$ with $m$ symbols. Each node is connected to the nodes whose labels are obtained by left-shifting the node-label array and by placing in the rightmost position a symbol from $\mathcal{A}$. As a consequence, each node is connected to $m$ nodes ($D = m$) and the number of nodes in the network is $P = m^n$. Thus, in general, de-Bruijn topologies for given $P$ and $D$ values not always exist. This limitation can be overcome by using generalized de-Bruijn topologies [33],
where there is a connection from node $i$ to node $j$ if

$$j = (i \cdot D + k) \mod P \quad k = 0, 1, \ldots, D - 1$$  \hspace{1cm} (4)

A further limitation of de-Bruijn and generalized de-Bruijn topologies are self loops that are present in some nodes, e.g. node 0 ("0000") in Fig. 2.

This limitation is overcome by Kautz topologies where nodes are labeled as in de-Bruijn topologies but avoiding sequences with equal symbols in consecutive positions of the node-label array (Kautz sequences). Then, node connections are obtained as for de-Bruijn topologies, where the symbol placed in the rightmost position of the node-label array is taken from $A$, subject to the constraint that the obtained node-label array is a Kautz sequence. As a consequence, each node is connected to $m - 1$ nodes ($D = m - 1$) and the number of nodes in the network is $P = m \cdot (m - 1)^{n-1}$. Thus, as for de-Bruijn topologies, Kautz topologies for assigned $P$ and $D$ values not always exist. This problem is eliminated by using generalized Kautz topologies [34], where there is a connection from node $i$ to node $j$ if

$$j = (-i \cdot D - k) \mod P \quad k = 1, \ldots, D$$  \hspace{1cm} (5)

Unfortunately, some generalized Kautz topologies have self loops; as an example with $P = 16$ and $D = 2$ node 10 ("1010") has a self loop (see Fig. 2). However, said $n_{sl}^P(P, D)$ the number of self loops introduced by a generalized de-Bruijn (B) or generalized Kautz (K) topology, we have that $n_{sl}^P(P, D) \leq n_{sl}^B(P, D)$ (see Fig. 2). As will be discussed in section V self loops can be managed in different ways depending on the routing algorithm.

Moreover, we included in our investigation honeycomb networks that, as suggested in [35], are alternatives to toroidal meshes that reduce nodes degree to $D = 3$. Thus, we have that rings, honeycombs and toroidal meshes are represented as undirected graphs, whereas de-Bruijn (generalized de-Bruijn) and Kautz (generalized Kautz) correspond to directed graphs. As an example in Fig. 2 we show the aforementioned topologies for $P = 16$.

V. ROUTING ALGORITHMS

Since in turbo decoder architectures the achieved throughput is a key objective, we should try to deliver messages following the shortest available path. Furthermore the NOC must guarantee that all messages are delivered to the destination, namely dropping of messages to avoid dead-locks is not allowed as it could impair the decoder correction capability. As highlighted in [23] shortest-path based routing algorithms are suited to achieve high throughput and grant message delivery. In the following we will consider both single-shortest-path (SSP) and all-local-shortest-path (ASP) based routing algorithms. In SSP algorithms only one shortest-path from each node $i$ to each node $k$ is considered. On the other hand, ASP algorithms rely on the fact that in a topology two nodes $i$ and $k$ may be connected by more shortest-paths. At each node $i$, the actual routing choice toward node $k$ must be made by selecting one destination node directly connected to $i$ and belonging to a set $N_{i,k}$ defined as the set of all nodes adjacent to $i$ and placed on a shortest path between $i$ and $k$. As an example, in the toroidal mesh topology in Fig. 2 there are six distinct shortest paths that connect node 0 ("0000") to node 6 ("0110"). Two of these paths go through node 4 ("0100"), two of them through node 1 ("0001") and two through node 3 ("0111"). For the connection between nodes 0 and 6, SSP algorithm selects a unique path and the choice of the first intermediate connection among nodes 1, 3 and 4 depends the graph visiting order. On the contrary, ASP algorithm explores all shortest paths between 0 and 6 and this leads to consider all three connections from node 0 to nodes 1, 3 and 4 ($N_{0,6} = \{1, 3, 4\}$), see bold lines in Fig. 2.

A further choice in the design of the NOC is how input FIFOs have to be served. Three distinct strategies have been considered in this work and indicated as SSP Round-Robin (SSP-RR), SSP FIFO-length (SSP-FL) and ASP FIFO-length with traffic-spreading (ASP-FT). These alternatives are briefly described in the following, while further details on routing algorithms for NOC can be found in [36]. The SSP-RR approach is based on a circular serving policy coupled to the ASP approach. FIFO-length strategies are weighted RR with weights equal to the number of elements stored into each input FIFO. The SSP-FL approach serves the input FIFOs based on the number of elements contained in each input FIFO: the longest FIFO is served first and the shortest one is served last. The ASP-FT approach is based on the input FIFO length serving policy, as for SSP-FL, but it is more complex and can be described as follows. Let’s define $T_{j}^{l,i}$ as the set of input ports in a node $l \in N_{i,k}$ that can receive a message from node $i$ at time $j$. At time $j$ the number of elements contained in the input FIFO associated to port $p \in T_{j}^{l,i}$ with $l \in N_{i,k}$ is $L_{j,p}^{l}$. According to Algorithm 1, the ASP-FT routing algorithm chooses $l \in N_{i,k}$ and $\hat{p} \in T_{j}^{l,i}$ so that

$$L_{j,p}^{l} = L_{\min} = \min_{p,l}(L_{j,p}^{l})$$  \hspace{1cm} (6)

The couples $\hat{l}, \hat{p}$ that satisfy (6) belong to the set $J_{j,p}^{l,i}$. To choose only one couple in $J_{j,p}^{l,i}$ we operate a traffic spreading based selection, namely our objective is to spread the traffic as much as possible over the network. To that purpose we use a set of counters ($Q$), where each counter $Q_{j,p}^{l,i}$ is incremented each time a message is sent from node $i$ to node $\hat{l}$ through input port $\hat{p}$. Then, we select the couple $\hat{l}, \hat{p} \in J_{j,p}^{l,i}$ that is
associated to the least used path

\[ Q^{i,l}_{j,\tilde{l},\tilde{p}} = Q_{\text{min}} = \min_{\tilde{p},\tilde{l}} \{ Q^{i,l}_{j,\tilde{l},\tilde{p}} \} \] (7)

It is worth pointing out that, shortest-path based routing algorithms do not prevent output ports contention, that is a situation where two or more inputs need to send data to the same output port. Said \( T_{n,j,b} \) the set of inputs in node \( n \) that at time \( j \) need to send data to output port \( b \), the contention problem can be faced by properly choosing an input \( a \in T_{n,j,b} \) allowed to send its data to output port \( b \). The remaining inputs belonging to \( T_{n,j,b} \) can be managed in different ways. In this work we consider the following two approaches: i) storing \( a' \in T_{n,j,b} - \{a\} \) into the corresponding input FIFO so that we delay a colliding message, in the following we will refer to this approach as delay-colliding-message (DCM); ii) if possible, sending \( a' \in T_{n,j,b} - \{a\} \) to another output port \( b' \neq b \), send-colliding-message (SCM). The DCM approach aims at reducing the number of hops to deliver a message to its destination, whereas the SCM approach aims at reducing the maximum depth of the input FIFOs. It is worth pointing out that topologies with self loops cause a waste of resources with DCM. In fact, since we are considering dead-lock free

**Algorithm 1 ASP-FT routing algorithm**

```
1: \( L_{\text{min}} \leftarrow \infty \)
2: \( Q_{\text{min}} \leftarrow \infty \)
3: for all \( l \in N^{i,k} \) do
4:   build \( T_{j,l}^{i} \)
5: for all \( p \in T_{j,l}^{i} \) do
6:   get \( L_{j,p}^{i} \)
7: if \( L_{j,p}^{i} \leq L_{\text{min}} \) then
8:   if \( Q_{j,p}^{i} \leq Q_{\text{min}} \) then
9:     \( Q_{\text{min}} \leftarrow Q_{j,p}^{i} \)
10:    \( L_{\text{min}} \leftarrow L_{j,p}^{i} \)
11:  \( \tilde{l} \leftarrow l \)
12:  \( \tilde{p} \leftarrow p \)
13: end if
14: end if
15: end for
16: end for
17: \( Q_{j,\tilde{l},\tilde{p}}^{i} = Q_{j,\tilde{l},\tilde{p}}^{i} + 1 \)
```
routing algorithms, DCM does not use a self loop as it is not on a shortest path. On the other hand, SCM can use a self loop, as a possible destination for a colliding message. However, by sending a message to a self loop we reduce the number of elements in one of the input FIFOs but we increase the number of elements in the input FIFO of the self loop.

VI. TURBO NOC SIMULATOR

The Turbo NOC simulator [37] is a cycle accurate, SystemC [38] based NOC simulator, specifically tailored for turbo decoder architectures. It estimates the throughput and complexity of a parallel NOC based turbo decoder architecture. It requires as inputs the following elements: the topology description in the form of an adjacency matrix, the permutation law used at the encoder and represented as a sequence of integer values, the routing algorithm (SSP-RR, SSP-FL, ASP-FT), the selected approach to handle contention (DCM/SCM), and the description of the key SISO characteristics. An example of the SISO characteristics is given in Fig. 3, where the upper part represents a frame of extrinsic informations \((\lambda_{i,j})\) that is sliced in two windows, each of which contains \(W\) values. The bottom part of Fig. 3 describes the timing of the SISO output: when a start signal is asserted, the SISO outputs the first extrinsic information after a certain latency \(\Delta\), that depends on the forward and backward recursion scheduling [39], on the trellis initialization strategy [40] and on the parallelism level of the SISO architecture [41]. Then, depending on SISO internal scheduling, the \(\lambda_{i,j}\) values are output in a certain order. In Fig. 3 we assume that \(\lambda_{i,j}\) values are output in a last-in-first-out order (from light gray to black) according to the backward recursion order. The number of clock cycles between two consecutive \(\lambda_{i,j}\) is \(\tau\). Similarly, the number of clock cycles between the last output \(\lambda\) of a window (line-filled cell) and the first \(\lambda\) of the successive window (light gray cell) is \(\theta\). Stemming from Fig. 3, the required parameters to describe the SISO architecture are:

1) the window size \((W)\) [42],
2) the SISO latency \((\Delta)\) expressed in clock cycles;
3) the Recursion Order \((RO)\) used to send extrinsic informations on the network, namely Forward Recursion Order \((FRO)\) or Backward Recursion Order \((BRO)\),
4) the number of clock cycles between two consecutive outputs \(\lambda\) within a window \((\tau)\),
5) the number of clock cycles between the last output \(\lambda\) of a window and the first \(\lambda\) of the successive window \((\theta)\).

The simulator acts in two phases, a static phase (instantiation and binding) and a dynamic phase (cycle accurate simulation). During the static phase, the topology description defines \(P, D\) and all possible paths from one node to the other. The simulator represents the topology as a graph and calculates all the local shortest paths repeating the Floyd-Warshall algorithm [43] on pruned versions of the graph until no more local paths exist between a source node \(i\) and its adjacent nodes; it then stores each result of the Floyd-Warshall algorithm as an array. If a SSP routing algorithm is selected, only the first shortest path array is employed, otherwise all the shortest paths are considered. Moreover, \(P\) nodes are instantiated and binned according to the assigned topology and each SISO memory is loaded with \(N/P\) messages, based on the assigned permutation. The actual decoding process executed by SISO elements is not included in the tool, which only simulates the exchange of extrinsic informations. However, the SISO architecture parameters are employed to initialize a set of counters that are used to send the extrinsic information over the network with the same timing as in the real SISO architecture. The node is described by means of a hardware-description-language-like \((hdl-like)\) model. When the static phase is completed, the simulation starts resorting to the SystemC node simulator and performs a cycle accurate hdl-like simulation.

The results provided by the Turbo NOC simulator can be divided in two categories: cycle by cycle results and global results. The cycle by cycle results are: i) for each node, the status of each FIFO, ii) for each node the FIFO read enable and the crossbar configuration signals, iii) for each SISO the
t'(i, j) sequence. The global results are: i) for each FIFO in each node the maximum FIFO size, ii) for each node the minimum, maximum and average latency (in clock cycles) of each received message and the total number of clock cycles to deliver all the messages. It is worth pointing out that, since self loops are not used with DCM, the maximum size of the corresponding input FIFOs is zero. In this case the tool automatically reduces the degree of those nodes with the effect of saving complexity.

VII. ROUTING ALGORITHM ARCHITECTURE

In order to keep the NOC complexity as small as possible, SSP-RR and SSP-FL routing algorithms have been implemented with FA and PP node architectures, whereas the ASP-FT algorithm has been implemented as a routing memory with the AP node architecture.

A. SSP-RR and SSP-FL architectures

The SSP-RR and SSP-FL architectures, thoroughly shown in Fig. 4 and 5, are made of two main parts. The first part sorts the input FIFOs based on the selected priority method (round-robin or FIFO length) and generates $M$ signals, $S_0, \ldots, S_{M-1}$, where $S_0$ is the label of the input that is served first and $S_{M-1}$ is the label of the input that is served last. The second part serves the input FIFOs according to the order specified by the $S_0, \ldots, S_{M-1}$ sequence and generates the read-enable ($\text{ren}_i$) signals for the FIFOs, the load enable ($\text{le}_i$) signals for the output registers and the configuration commands for the crossbar ($\text{adx}_i$), where $\text{adx}_i$ represents the label of the destination node specified by the first message in FIFO $i$. As shown in Fig. 4 (a), in the SSP-RR architecture a rotate register generates the $S_i$ signals. On the other hand, (see Fig. 4 (b)), in the SSP-FL architecture the $S_i$ signals are obtained with a sorting network. In node $n$ the sorting network takes as input the number of elements contained at time $j$ in each input FIFO ($L_{i,p}^n$ with $p \in \{0, \ldots, M-1\}$) and outputs $S_0 = \arg \{\max_p \{L_{i,p}^n\}\}, \ldots, S_{M-1} = \arg \{\min_p \{L_{i,p}^n\}\}$. Both SSP-RR and SSP-FL architectures have been designed as parametric units to support different $M$ values.

The generation of the $\text{ren}_i$, $\text{adx}_i$ and $\text{le}_i$ signals is enabled by the FIFO empty signals of the input FIFOs and requires the following units: a look-up-table (LUT), $M$ reservation blocks and a priority decoder (Fig. 4 (c)). The LUT contains the shortest-path information.

In the SSP approach for each node $i$ the $N^{i,k}$ set contains only one node, and $I_{i,k}^{j,i}$ contains only one port. There is only an output port on node $i$ that connects node $i$ with node $k$ on a shortest path. Thus, every LUT contains $P$ locations and the LUT in node $i$ at location $k$ contains the label of the output port to connect node $i$ to node $k$. As a consequence, each LUT is a $P \times \lceil \log_2(M) \rceil$ table that converts $M$ destinations ($\text{dest}_i$) to the corresponding ports ($\text{port}_i$) for each input FIFO ($\text{dport}_i$).

The reservation blocks update an $M$-position binary mask to avoid collisions on output ports, whereas the priority decoder implements the selected priority and FIFO management policies by properly generating the the $\text{ren}_i$, $\text{adx}_i$, and $\text{le}_i$ signals. Since the $\text{le}_i$ and $\text{adx}_i$ signals must be asserted the clock cycle after the $\text{ren}_i$ and $\text{adx}_i$, they are delayed by means of registers. In particular, the $\text{le}_i$ and $\text{adx}_i$ are obtained by delaying the $\text{ren}_i$ and $\text{adx}_i$ of one clock cycle.

1) Reservation block: Each reservation block (Fig. 5 (a)) receives the $\text{dport}_i$ signals, according to the $S_0, \ldots, S_{M-1}$ sequence, generates a reservation signal ($\text{reserve}_i$) and specifies the output port to be reserved ($\text{port}_i$). The reservation is obtained by updating the $\text{rmask}_i$, which contains a ‘1’
in the position of a reserved output port and a ‘0’ in the position of a free output port. Each reservation block generates $\text{port}_i = \text{dport}_{S_i}$, that is converted by a one-hot decoder into a mask with a ‘1’ in position $\text{port}_i$. The reservation mask is updated (output rmask) by comparing this mask with the input rmask: if the input rmask contains a ‘0’ in position $\text{port}_i$, the reserve goes to ‘1’.

2) Priority decoder: The priority decoder is made of two blocks: the read-enable generation block (Fig. 5 (b)) and the destination-port generation block (Fig. 5 (c)).

a) read-enable generation block: The read-enable generation block is based on few logic gates that act differently depending on the approach selected to manage the input FIFOs (SCM/DCM): i) in the DCM approach, $\text{ren}_i = \text{reserve}_i$, when FIFO $i$ is not empty (FIFO empty$_i$ = ‘0’) is obtained by combining $S_i$ one-hot representation with the corresponding reserve signal. ii) in the SCM approach, $\text{ren}_i = \text{en}_i$, when FIFO $i$ is not empty (FIFO empty$_i$ = ‘0’) is based on $\text{en}_i$ that is a set of $M$ signals produced by the destination-port generation block, where $\text{en}_i = '0'$ when $l\text{ren}_i = '0'$ and $\text{ladx}_i = M - 1$, namely the output port with label $M - 1$ is used only for messages whose destination is the node itself (Fig. 5 (c)).

b) destination-port generation block: This is an array of multiplexers, where each multiplexer in position $i, i$ implements $\text{ladx}_i = \text{port}_i$, when $l\text{ren}_i = '1'$. On the other hand, $\text{ladx}_i$ must take the value of an un-reserved output port when $l\text{ren}_i = '0'$. This is obtained by means of the permutation network implemented by the multiplexers in position $j, i$ with $j \neq i$ whose outputs ($\text{mux}_{j,i}$) are

$$\text{mux}_{j,0} = \begin{cases} 0 & \text{if } \text{port}_0 = j \\ j & \text{otherwise} \end{cases} \quad (8)$$

and for $i > 0$

$$\text{mux}_{j,i} = \begin{cases} \text{mux}_{i,i-1} & \text{if } \text{port}_i = j \\ \text{mux}_{j,k} & \text{otherwise} \end{cases} \quad (9)$$

where

$$k = \begin{cases} i - 2 & \text{if } j = i - 1 \\ i - 1 & \text{otherwise} \end{cases} \quad (10)$$

and if $k < 0$, then $\text{mux}_{j,k} = 0$.

B. ASP-FT architecture

The ASP-FT algorithm is simply implemented by means of a routing memory. As a consequence, DCM and SCM approaches are integrated by filling the routing memory with the appropriate configuration words. Each word is the concatenation of the $\text{ren}_0, \ldots, \text{ren}_{M-1}$ signals with the $\text{adx}_0, \ldots, \text{adx}_{M-1}$ signals. In order to reduce the word width, the $\text{adx}_0, \ldots, \text{adx}_{M-1}$ signals, which can be represented on $M \times \lceil \log_2(M) \rceil$ bits, are coded into a crossbar configuration word (ccw). Since for an $M$-port crossbar the possible configurations are $M!$, ccw is represented on $\lceil \log_2(M!) \rceil$ bits. The corresponding decoder is hardwired into the crossbar. Thus, as shown in Fig. 6 the main component in the routing memory architecture is a RAM. The RAM address (radx) is generated by an adder and a register. The adder is incremented when at least one input FIFO is not empty (FIFO empty$_i$ = ‘0’) and it is initialized to zero when the half iteration starts (init). Moreover, the $\text{ren}_i$ are forced to ‘0’ when FIFOs are empty.

C. Architecture implementation

To achieve high throughput the routing algorithm should be able to serve the input FIFOs in one clock cycle. This requirement, which is an intrinsic feature of the routing memory architecture used for the ASP-FT, implies that the architectures for the SSP-RR and SSP-FL routing algorithms are combinational circuits. As it can be inferred from Fig. 4 and 5 the speed of SSP-RR and SSP-FL architectures depends mainly on $M$, in fact, $M$ impacts on the size of several parts of the routing algorithm architectures, namely the sorting network, the shortest-path information LUT, the reservation mask, the priority decoder and on the number of reservation blocks. Given the topologies presented in section IV, we described the SSP-RR and SSP-FL architectures as parametric blocks and we performed the logical synthesis on a 130 nm standard cell technology for $M \in \{3, 4, 5\}$. Post synthesis results confirm that a clock frequency of more than 200 MHz is achieved with a complexity that ranges from about 1000 $\mu$m$^2$ to about 6000 $\mu$m$^2$.

VIII. SIMULATIONS AND RESULTS

The Turbo NOC simulator has been used to simulate both interleaving and de-interleaving with four significant permutation laws, namely:

1) WiMax interleaver with $N=2400$ and $W=38$
2) UMTS interleaver with $N=5114$ and $W=40$
3) A prunable $S$-random interleaver [44] with $N=16384$ and $W=37$
4) A circular shifting interleaver [45] with $N=24576$ and $W=39$

We tested the following topologies:

1) ring (R)
2) toroidal mesh (T)
3) honeycomb (H)
4) generalized de-Bruijn (B)
5) generalized Kautz (K)
for \( P \in \{8, 16, 32, 64\} \), with SSP-RR, SSP-FL and ASP-FT routing algorithms and including DCM and SCM approaches for FIFO management. The SISO architecture parameters were set as follows: \( \Delta = W/R, \theta = \tau = \rho = 1 \) and backward.
recursion order (RO=BRO). The choice of the window size has been obtained as follows. Usually, in optimized turbo decoder architectures $W$ is chosen such that $N/(P \cdot W) \in \mathbb{N}$ to maximize the decoder throughput and simplify SISOS synchronization [24]. In some cases this condition can be easily satisfied (e.g. $N=2400$, $P=8$ and $W=30$), whereas in other cases it is not possible (e.g. $N=5114$, $P=8$ leads to $N/P \notin \mathbb{N}$). Given this situation we decided to fix a value of $W$ for each $N$ so that $N/(P \cdot W) \notin \mathbb{N}$. This choice avoids “lucky cases” with $N/(P \cdot W) \in \mathbb{N}$ and makes the comparison fair. Moreover, we tried to choose the $W$ values as close as possible one to the other to avoid results that are too influenced by the $W$ value itself. For each case, the Turbo NOC simulator provided the total number of cycles required to perform a complete iteration (interleaving and de-interleaving), the depth of each FIFO in the network, the content of each routing memory (see Fig. 1 (b)) and the $t'(i,j)$ sequence to be stored into the location memory (see Fig. 1 (b) and (c)). As a consequence, for each case we can estimate the achieved throughput for a certain clock frequency with a given number of iterations. Moreover, to characterize the complexity of each solution we give the synthesis results of all simulated networks for a 130 nm standard cell technology. Memories have been generated by means of a 130 nm memory generator. The area results concern all the nodes in the network where each node includes the blocks depicted in Fig. 1 except the SISO and the memory used to store the extrinsic information (shaded gray blocks in Fig. 1). The complexity characterization shown in this work does not consider post place and route area overhead. The actual routing overhead is expected to be quite limited for regular topologies, such as meshes and toroids, which can be implemented by means of a modular layout. For example in [46] and [47] it is shown that for regular topologies, at least at the 130 nm technology node, the area occupation of logic cells in the design gives a useful indication about the actual complexity of the NOC. Routing overhead in less regular topologies is somewhat more difficult to estimate before place and route. However, results are available in the literature showing that a low area layout can be generated for generalized de-Bruijn networks. For example in [48] and [49] an optimal layout algorithm is run on generalized de-Bruijn topologies leading to a layout comparable and in some cases smaller than the one required by toroidal meshes. More in general, since the layout area of a network $x$ can be expressed as $\Omega(|B(x)|_{\min}^2)$ [50], where $|B(x)|_{\min}$ is the minimum bisection width of the network, this model can be used to roughly compare different topologies. For example, $|B(T)|_{\min} = 2\sqrt{P}$ and $|B(B)|_{\min} = \Omega(|P|/\log P)$ for toroidal mesh and for de-Bruijn topologies respectively: this shows that the routing overhead of de-Bruijn network is similar to that of toroidal meshes, at least up to 64 nodes.

As a significant case of study we consider each extrinsic information value represented on 8 bits. Thus, we represented $\lambda$ on 8 bits for all the simulations, except the ones related to the WiMax permutation law. In fact, since the WiMax turbo code is double binary, its extrinsic information is an array made of three log-likelihood ratios, as a consequence a message is represented on 24 bits. Moreover, we consider $f_{\text{clk}} = 200$ MHz and $I = 8$; thus, from (3) we can infer that to sustain a target throughput of $T = 200$ Mb/s, we need at least $d \cdot P \cdot R = 16$, namely at least $P \cdot R = 16$ for binary codes and at least $P \cdot R = 8$ for double binary codes. However, due to the $IL$ term in (2), higher values of the $P \cdot R$ product are also of interest.

The analysis of the experimental results obtained with the Turbo NOC simulator shows some interesting general properties.

1. SSP solutions adopting the FA node architecture are the most demanding implementations in terms of area. Since the PP node architecture achieves the same throughput as the FA solution with a lower area, in the following only the PP node architecture will be addressed.

2. The DCM FIFO management method performs better than the SCM one both in terms of throughput and complexity. As a consequence, in the following only results that are referred to the DCM approach will be presented.

3. Generalized de-Bruijn and generalized Kautz topologies achieve nearly the same results both in terms of throughput and complexity. In the following only results obtained with generalized Kautz topologies will be presented.

4. Results tend to be clustered into two families, namely short interleavers (WiMax interleaver with $N=2400$ and UMTS interleaver with $N=5114$) and long interleavers (prunable S-random interleaver with $N=16384$ and circular shifting interleaver with $N=24576$). For the sake of clarity, in the following, only results obtained for the WiMax interleaver ($N=2400$) and circular shifting interleaver ($N=24576$) will be presented.

The most significant experimental results are summarized in Table II and III that refer to the WiMax interleaver with $N = 2400$ and to the circular shifting interleaver with $N = 24576$ respectively. Each cell of the two tables gives the throughput in Mb/s and the area in $\text{mm}^2$ obtained for different $P$ and $R$ values, routing algorithms and architectures for the DCM approach. In Table III light-gray, mid-gray and dark-gray cells indicate the highest throughput, the highest area and the lowest area points for each $D$ value respectively.

The most important conclusions that can be derived from results in Table II and III are:

1. In most cases, topologies with $D=4$ achieve higher throughput with lower complexity overhead than topologies with $D=2$ when $R \rightarrow 1$. On the other hand, when $R < 1$ increasing $D$ from 2 to 3 always leads to a higher throughput and a smaller area, whereas increasing $D$ from 3 to 4 leads to higher throughput but also to higher area.

2. The ASP-FT routing algorithm is the best performing solution both in terms of throughput and area when $R = 1$.

3. The routing memory overhead of the ASP-FT algorithm (see Fig. 1 (b)) becomes relevant as $R$ decreases and SSP solutions become the best solutions mainly for $P = 8$ and $P = 16$. 

Figure 7. Throughput/area comparison of different topologies for the case $R = 1$, ASP-FT routing algorithm, DCM approach

Figure 8. Throughput/area comparison of different topologies and routing algorithm with DCM approach: WiMax interleaver, $N = 2400$ (a) and $R = 0.33$, $P = 16$ (b); circular shifting interleaver, $N = 24576$ for $R = 1$, $P = 64$ (c) and $R = 0.33$, $P = 16$ (d)
Table V
HARDWARE RESOURCES BREAKDOWN FOR THE CIRCULAR SHIFTING INTERLEIVER WITH N=24576, DCM APPROACH; SOME SIGNIFICANT POINTS

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>2</td>
<td>R</td>
<td>64</td>
<td>ASP-FT (AP)</td>
<td>11.45 (95.15%)</td>
<td>0.03 (0.15%)</td>
<td>0.08 (0.41%)</td>
<td>6.35 (32.80%)</td>
<td>1.45 (74.49%)</td>
<td>19.36 (100%)</td>
</tr>
<tr>
<td>2</td>
<td>R</td>
<td>64</td>
<td>SSP-RR (PP)</td>
<td>27.46 (90.43%)</td>
<td>0.05 (0.16%)</td>
<td>0.14 (0.46%)</td>
<td>0.11 (0.36%)</td>
<td>0.26 (8.59%)</td>
<td>30.37 (100%)</td>
</tr>
<tr>
<td>2</td>
<td>R</td>
<td>8</td>
<td>SSP-FL/RR (PP)</td>
<td>0.05 (2.84%)</td>
<td>0.01 (0.57%)</td>
<td>0.01 (0.57%)</td>
<td>0.01 (0.57%)</td>
<td>0.16 (95.45%)</td>
<td>1.76 (100%)</td>
</tr>
<tr>
<td>2</td>
<td>K</td>
<td>64</td>
<td>ASP-FT (AP)</td>
<td>6.53 (60.46%)</td>
<td>0.03 (0.28%)</td>
<td>0.08 (0.74%)</td>
<td>2.71 (25.09%)</td>
<td>1.45 (13.43%)</td>
<td>10.80 (100%)</td>
</tr>
<tr>
<td>2</td>
<td>K</td>
<td>64</td>
<td>SSP-RR (PP)</td>
<td>23.81 (89.11%)</td>
<td>0.05 (0.19%)</td>
<td>0.14 (0.52%)</td>
<td>0.11 (0.41%)</td>
<td>0.26 (9.77%)</td>
<td>26.72 (100%)</td>
</tr>
<tr>
<td>2</td>
<td>K</td>
<td>8</td>
<td>SSP-FL (PP)</td>
<td>0.05 (2.86%)</td>
<td>0.10 (5.75%)</td>
<td>0.11 (5.75%)</td>
<td>0.16 (96.00%)</td>
<td>1.75 (100%)</td>
<td></td>
</tr>
<tr>
<td>3</td>
<td>H</td>
<td>64</td>
<td>ASP-FT (AP)</td>
<td>9.25 (68.77%)</td>
<td>0.09 (0.67%)</td>
<td>0.11 (0.82%)</td>
<td>2.55 (18.96%)</td>
<td>1.45 (10.78%)</td>
<td>13.45 (100%)</td>
</tr>
<tr>
<td>3</td>
<td>H</td>
<td>64</td>
<td>SSP-RR (PP)</td>
<td>18.02 (85.24%)</td>
<td>0.10 (0.47%)</td>
<td>0.18 (0.85%)</td>
<td>0.23 (1.09%)</td>
<td>0.26 (12.55%)</td>
<td>21.14 (100%)</td>
</tr>
<tr>
<td>3</td>
<td>H</td>
<td>8</td>
<td>SSP-RR (PP)</td>
<td>0.05 (2.84%)</td>
<td>0.01 (0.57%)</td>
<td>0.01 (0.57%)</td>
<td>0.16 (95.45%)</td>
<td>1.76 (100%)</td>
<td></td>
</tr>
<tr>
<td>3</td>
<td>K</td>
<td>64</td>
<td>ASP-FT (AP)</td>
<td>7.86 (67.41%)</td>
<td>0.09 (0.77%)</td>
<td>0.11 (0.94%)</td>
<td>2.15 (18.44%)</td>
<td>1.45 (12.44%)</td>
<td>11.66 (100%)</td>
</tr>
<tr>
<td>3</td>
<td>K</td>
<td>64</td>
<td>SSP-FL (PP)</td>
<td>13.15 (80.33%)</td>
<td>0.10 (0.61%)</td>
<td>0.18 (1.01%)</td>
<td>0.33 (2.02%)</td>
<td>0.26 (15.94%)</td>
<td>16.37 (100%)</td>
</tr>
<tr>
<td>3</td>
<td>K</td>
<td>8</td>
<td>SSP-FL (PP)</td>
<td>0.06 (3.35%)</td>
<td>0.01 (0.56%)</td>
<td>0.02 (1.12%)</td>
<td>0.02 (1.12%)</td>
<td>1.68 (93.85%)</td>
<td>1.79 (100%)</td>
</tr>
<tr>
<td>4</td>
<td>T</td>
<td>64</td>
<td>ASP-FT (AP)</td>
<td>5.14 (55.94%)</td>
<td>0.23 (2.5%)</td>
<td>0.13 (1.41%)</td>
<td>2.24 (24.37%)</td>
<td>1.45 (15.78%)</td>
<td>9.19 (100%)</td>
</tr>
<tr>
<td>4</td>
<td>T</td>
<td>64</td>
<td>SSP-RR (PP)</td>
<td>13.21 (79.63%)</td>
<td>0.17 (1.02%)</td>
<td>0.23 (1.39%)</td>
<td>0.37 (2.23%)</td>
<td>0.26 (15.73%)</td>
<td>16.59 (100%)</td>
</tr>
<tr>
<td>4</td>
<td>T</td>
<td>8</td>
<td>SSP-FL (PP)</td>
<td>0.06 (3.35%)</td>
<td>0.01 (0.56%)</td>
<td>0.02 (1.12%)</td>
<td>0.02 (1.12%)</td>
<td>1.67 (93.85%)</td>
<td>1.78 (100%)</td>
</tr>
<tr>
<td>4</td>
<td>K</td>
<td>64</td>
<td>ASP-FT (AP)</td>
<td>3.78 (49.67%)</td>
<td>0.22 (2.89%)</td>
<td>0.13 (1.71%)</td>
<td>2.03 (26.68%)</td>
<td>1.45 (19.05%)</td>
<td>7.61 (100%)</td>
</tr>
<tr>
<td>4</td>
<td>K</td>
<td>64</td>
<td>SSP-RR (PP)</td>
<td>6.86 (67.12%)</td>
<td>0.16 (1.57%)</td>
<td>0.23 (2.25%)</td>
<td>0.36 (3.52%)</td>
<td>2.61 (25.54%)</td>
<td>10.22 (100%)</td>
</tr>
<tr>
<td>4</td>
<td>K</td>
<td>8</td>
<td>SSP-FL (PP)</td>
<td>0.06 (3.35%)</td>
<td>0.10 (0.56%)</td>
<td>0.02 (1.11%)</td>
<td>0.03 (1.67%)</td>
<td>1.68 (93.33%)</td>
<td>1.80 (100%)</td>
</tr>
</tbody>
</table>

(1) The area and the percentage are not really zero, but they are negligible compared with the IM and LM contribution to the total area.

4) In most cases, generalized Kautz topologies are the best performing topologies.

As a significant example, in Fig. 7, we show the experimental results obtained with $R = 1$ and ASP-FT routing algorithm for the WiMax interleaver with $N = 2400$ (a) and the circular shifting interleaver with $N = 24576$ (b). Each point represents the throughput and the area obtained for a certain topology with a certain parallelism degree $P$. Results referred to the same $P$ value are bounded into the same box and a label is assigned to each point to highlight the corresponding topology, namely topologies are identified as R-ring, H-honeycomb, T-toroidal mesh, K-generalized Kautz with the corresponding $D$ value ($K_2$, $K_3$, $K_4$).

As it can be observed, generalized Kautz topologies with $D = 4$ ($K_4$) are always the best solutions to achieve high throughput with minimum area overhead. Moreover, we observe that increasing $D$ has the effect of reducing network congestion, which has two positive consequences: higher throughput and smaller size of input FIFOs. However, the saved area benefit is partially mitigated by the increased complexity of the routing algorithm, as highlighted in the first part of Table V for PP node architectures: for example, K topology requires 23.81 mm$^2$ for FIFOs and 0.11 mm$^2$ for routing logic/memory when $D = 2$, whereas areas of 6.86 mm$^2$ and 0.36 mm$^2$ are occupied when $D = 4$. On the contrary, if an AP architecture is used for switching nodes, the routing memory size progressively becomes smaller with increasing $D$, due to the higher number of packets that can be switched at each clock cycle. For example, in the second part of Table V (AP node architecture), the area occupied by the routing memory (RA/M) reduces with increasing $D$.

In Fig. 8 significant results extracted from Table II and III are shown in graphical form. In particular, for $R = 1$ the ASP-FT routing algorithm is the best solution, whereas for $R < 1$ SSP routing algorithms, implemented as PP node architectures, tend to achieve the same performance as the ASP-FT routing algorithm with lower complexity overhead (see Fig. 8 (a) and (b) for the WiMax interleaver, $N = 2400$ and Fig. 8 (c) and (d) for the circular shifting interleaver, $N = 24576$).

An interesting phenomenon that arises increasing the interleaver size is the performance saturation that can be observed in the Table III for $D = 2$ topologies, namely the throughput tends to saturate and increasing $R$ has the effect of augmenting the area with a negligible increase or even with a decrease of throughput. As an example, the generalized Kautz topology with $P = 64$ and ASP-FT routing algorithm achieves more than 180 Mb/s with $R = 1$, $R = 0.5$, $R = 0.33$. However, the solution with the smallest area is the one obtained with $R = 0.33$.

The throughput flattening of low $D$ topologies can be explained by observing that high values of $R$ tend to saturate the network. Furthermore, high values of $R$ lengthen the input FIFOs as highlighted in Table IV, where the total area of the network is given as the breakdown of the building blocks, namely the input FIFOs, the crossbars (CB), the output registers, the routing algorithm/memory (RA/M), the identifier memory (IM) and the location memory (LM) is given for some significant cases: the highest throughput (light-gray), the highest area (mid-gray), and lowest area (dark-gray) points for
each $D$ value in Table III.

IX. CONCLUSIONS

In this work a general framework to design network on chip based turbo decoder architectures has been presented. The proposed framework can be adapted to explore different topologies, degrees of parallelism, message injection rates and routing algorithms. Experimental results show that generalized de-Bruijn and generalized Kautz topologies achieve nearly the same results both in terms of throughput and complexity. Moreover, as highlighted in Table II and III generalized Kautz topologies achieve high throughput with a limited complexity overhead. In particular, in most cases, topologies with $D=4$ achieve higher throughput with lower complexity overhead than topologies with $D=2$ when $R \rightarrow 1$. On the other hand, when $R < 1$ increasing $D$ from 2 to 3 always leads to a higher throughput and a smaller area, whereas increasing $D$ from 3 to 4 leads to higher throughput but also to higher area. Finally, depending on the target throughput requirements different parallelism degrees, message injection rates and routing algorithms can be used to minimize the network area overhead.

ACKNOWLEDGMENT

This work is partially supported by the Italian National Program FIRB, contract n. RBAP06L455 and by the Newcom++ network of excellence, funded by the European Community.

REFERENCES


Maurizio Martina was born in Pinerolo, Italy, in 1975. He received the M.Sc. and Ph.D. in electrical engineering from Politecnico di Torino, Italy, in 2000 and 2004 respectively. He is currently a Post-doctoral Researcher at the VLSI Lab, Politecnico di Torino. His research activities include VLSI design and implementation of architectures for digital signal processing and communications.

Guido Masera received the Dr.Eng. degree (summa cum laude) in 1986, and the Ph.D. degree in electrical engineering from Politecnico di Torino, Italy, in 1992. Since 1986 to 1988 he was with CSELT (Centro Studi e Laboratori in Telecomunicazioni, Torino, Italy) as a researcher involved in the standardization activities for the GSM system. Since 1992 he has been Assistant Professor and then Associate Professor at the Electronic Department, where he is a member of the VLSI-Lab group. His research interests include several aspects in the design of digital integrated circuits and systems, with special emphasis on high-performance architecture development (especially for wireless communications and multimedia applications) and on-chip interconnect modeling and optimization. He has coauthored more than 160 journal and conference papers in the areas of ASIC-SoC development, architectural synthesis, VLSI circuit modeling and optimization. In the frame of National and European research projects, he has been co-designer of several ASIC and FPGA implementations in the fields of Artificial Intelligence, Computer Networks, Digital Signal Processing, Transmission and Coding.