Modern iterative channel code decoder architectures have tight constrains on the throughput but require flexibility to support different modes and standards. Unfortunately, flexibility often comes at the expense of increasing the number of clock cycles required to complete the decoding of a data-frame, thus reducing the sustained throughput. The Network-on-Chip (NoC) paradigm is an interesting option to achieve flexibility, but several design choices, including the topology and the routing algorithm, can affect the decoder throughput. In this work logarithmic diameter topologies, in particular generalized de-Bruijn and Kautz topologies, are addressed as possible solutions to achieve both flexible and high throughput architectures for iterative channel code decoding. In particular, this work shows that the optimal shortest-path routing algorithm for these topologies, that is still available in the open literature, can be efficiently implemented resorting to a very simple circuit. Experimental results show that the proposed architecture features a reduction of about 14% and 10% for area and power consumption respectively, with respect to a previous shortest-path routing-table-based design.
Introduction
Flexible and scalable interconnection systems have become of paramount importance in modern System-on-Chip (SoC) designs. In this context, the Network-on-Chip (NoC) approach [1, 2, 3, 4] has been proposed as an interesting solution. Several research efforts have been spent in the last years not 5 only to improve NoC reliability and efficiency, but also to reduce complexity and save power consumption by means of several different techniques, such as [5, 6, 7] and including wireless NoCs [8] that exploit Ultra-Wide-Band technology [9] . Several design parameters, as NoC topology, Routing Algorithm (RA) and scheduling policy [10, 11, 12] affect the complexity and the power consump-10 tion of the NoC. Most of the works published in the literature (e.g. [13, 14] ) highlight that 2D mesh and 2D mesh-like topologies are among the best topologies for tile-based ASIC implementation. However, some recent publications have shown that the general NoC approach can not be straightforwardly applied to high throughput and low latency applications. Indeed, in [15] application-15 specific NoCs are proposed, i.e. the NoC used to interconnect different IPs is tailored around the application. This concept is further developed in [16] , where intra-IP NoCs are described. The intra-IP NoC is proposed as a very low complexity interconnection structure to build flexible IPs, whose computation relies on a parallel architecture made of several processing elements (PEs). ples of high throughput and low latency architectures that can take advantage of intra-IP NoCs are mainly in the field of baseband processing for telecommunications, such as baseband multiple-input-multiple-output systems [17] and iterative channel code decoders [18] . Both applications require flexibility to support different coding modes and standards. In particular, in [18] both turbo 25 and Low-Density-Parity-Check (LDPC) codes from several different standards are supported. It is worth noting that since the throughput of iterative channel code decoders is inversely proportional to the latency of the architecture, reducing the latency of the NoC is a viable solution to increase the throughput. In these cases previous works [19, 20] show by simulation that the minimization 30 of the number of clock cycles spent to send a message from the source node to the destination node is of paramount importance and the maximum distance between two nodes, i.e. the diameter of the topology, should be minimized to reduce the delivery time. Recently, logarithmic diameter topologies, such as de-Bruijn [21] and Kautz [22] topologies, have gained popularity in the research 35 community [20, 23, 24, 25, 26, 27] as viable alternatives to the well known mesh and mesh-like solutions. Indeed, logarithmic-diameter topologies like de-Bruijn and Kautz manage to guarantee short minimum-distance path while showing a regular structure, similar to that of the simpler mesh. Furthermore, [28] shows an algorithm to build optimal VLSI layout for de-Bruijn topologies. Another 40 important characteristic of de-Bruijn and Kautz topologies is the self-routing property [29] , that is exploited in [30] to develop a shortest-path RA to connect any pair of nodes. Even if [30] solved the problem of building an optimal shortest-path RA for these topologies, low complexity hardware implementations are still of interest in the context of NoC [23, 24, 25, 26, 27] . A well 45 known flexible solution to support most topologies and RAs employs routingtables. However, it has been shown that this approach does not scale in terms of latency and area. This problem has been faced in [31] and further developed in [14, 32] , where Logic-Based-Distributed-Routing (LBDR) for mesh-derived topologies is proposed. For a survey on LBDR the reader can refer to [33] .
50
Inspired by the LBDR approach, this work proposes to exploit the optimal shortest-path RA described in [30] for de-Bruijn and Kautz topologies to implement a distributed RA that features lower complexity than the routingtable-based architecture. The proposed solution is applied to the flexible architecture for iterative channel code decoding proposed in [18] extending the 55 results presented in [34] . Even if optimized interconnection structures can be custom designed for one channel code decoder, or for a limited number of cases, this approach is not practical when high flexibility is required, e.g. when the decoder supports more codes and communications standards [18, 35, 36] . Thus, the contributions of this work are i) to show that the optimal shortest-path RA 60 described in [30] can be implemented as a distributed RA, obtaining a very low complexity circuit, ii) to employ this circuit in the design of flexible iterative channel code decoder architectures, achieving better results than the ones shown in state of the art [18] .
The rest of the paper is organized as follows. Section 2 summarizes related work and Section 3 presents generalized de-Bruijn and Kautz topologies, highlighting the main characteristics that can be exploited in NoC design. In Section 4 the optimal shortest-path RA proposed in [30] is summarized and a very simple circuit for implementing it is proposed. This circuit is finally exploited in Section 5 to reduce the complexity of the intra-IP-NoC based architecture proposed 70 in [18] for Turbo and LDPC code decoding. Finally, in Section 6 conclusions are drawn.
Related work
In [23] the authors show that de-Bruijn and Kautz topologies achieve higher performance and consume less power than meshes in general NoC-based systems.
75
A similar result is shown in [24] where the routing is based on the optimal shortest-path RA described in [30] . However, [24] targets virtual channels with a general packet structure and stores in the header flit tag bits. The meaning of these tag bits will be clarified in Section 4, where we will show that tag bits can be removed from the packet. The work in [25] proposes two RA based on shift 80 direction. The most complex one, referred to as shortest shifting based RA, relies on shortest-path routing. However, since [25] does not target a specific application, the packet structure is more complex than the one used in this current work. In [27] a simple RA for de-Bruijn topologies, based on shift and compare operations as in [25] , is proposed. However, it is applied only to binary 85 de-Bruijn topologies and no area results are given. In [26] a Kautz NoC is used to design a very high speed neocortical computing processor for visual recognition. The properties of the Kautz topology, in particular its logarithmic diameter, are exploited to enable fast communication among the processing cores and to perform distributed low-radix routing.
90
On the other hand, the works presented in [18, 19, 20] propose de-Bruijn and Kautz topologies for iterative channel code decoder architectures. In particular, results in [19, 20] confirm that for iterative channel code decoder architectures, the throughput achieved by de-Bruijn and Kautz topologies is from 10% to 50% higher than the one obtained with ring, honeycomb and mesh topologies with 95 nearly the same area. These results are exploited in [18] to design a NoC based decoder architecture that supports both turbo and LDPC codes from several different standards. All the works in [18, 19, 20] exploit shortest-path routing to minimize the latency and maximize the throughput, however they to not use the optimal RA described in [30] . Moreover, [18, 19 ] rely on routing tables, 100 whereas [20] is limited to binary de-Bruijn topologies and it does not provide details on the circuit to implement the shortest-path RA.
Finally, we expect that other applications could benefit from the use of deBruijn and Kautz topologies. As an example NoC-based architectures for video processing have been considered in [37, 38] . Moreover, decoders for the H.264 105 standard for video compression are included in the MSCL NoC benchmark suite [39] . In particular, H.264 decoders have a number of tasks and links that is comparable with the turbo/LDPC decoder case of study. It is worth pointing out that, as argued in [24] , de-Bruijn and Kautz topologies are scalable, meaning that the general architecture to implement the node (and the RA) remains the 110 same independently of the number of nodes.
De-Bruijn and Kautz topologies
In this section De-Bruijn and Kautz topologies are presented. The notation used to describe the topologies and to introduce their characteristics is summarized in Table 1 . De-Bruijn and Kautz topologies are obtained by building directed graphs according to the following definitions.
Definition 1. A de-Bruijn sequence is an array of q elements, where each element is taken from an alphabet A with l symbols.
Thus, a de-Bruijn graph is made of nodes labeled with de-Bruijn sequences. 
that is w is obtained by left-shifting v and by placing in the rightmost position a symbol from A. As a consequence, each node is connected to l nodes. Thus, the graph is regular with degree D = l and the number of nodes is P = l q ( Fig.  1 (a) ). Unfortunately, de-Bruijn graphs have self-loops (one node connected to itself). Self-loops can be avoided using Kautz graphs. From this definition we infer that a Kautz graph is made of nodes labeled with Kautz sequences and there is an arc from node v to node w if w i = v i−1 for 135 1 ≤ i ≤ q − 1 (with w 1 ̸ = w 0 ). Equivalently, there is an arc from node v to node w if w is obtained by left-shifting v and by placing in the rightmost position a symbol from A, subject to the constraint that the result is a Kautz sequence. As a consequence, each node is connected to l − 1 nodes and the graph is regular with degree D = l − 1; the number of nodes is P = l · (l − 1) q−1 ( Fig. 1 (b) ).
140
As it can be inferred from above definitions, de-Bruijn and Kautz graphs for given P and D not always exist. This limitation is overcome by generalized de-Bruijn and generalized Kautz graphs, proposed in [40, 41, 42] . Fig. 1 (c) ).
Definition 3. A generalized de-Bruijn graph has an arc from node v to node w if (1) holds true:
145 w = (D · v + r) mod P,(1)with 0 ≤ v ≤ P − 1 and 0 ≤ r ≤ D − 1 (
Definition 4. A generalized Kautz graph has an arc from node v to node w if (2) holds true:
with 0 ≤ v ≤ P − 1 and 0 ≤ r ≤ D − 1 ( Fig. 1 (d) ). 
that is D + 1 is a divider of P [29] . Moreover, as detailed in [41, 42] , generalized de-Bruijn and generalized Kautz graphs have logarithmic diameter. This property is particularly interesting in the NoC context as it ensures that the length of shortest paths connecting any two nodes (the number of hops) grows as the logarithm of P . In particular, the diameter of generalized de-Bruijn and 
Optimal shortest path routing: algorithm and implementation
In this section the optimal shortest path RA, proposed in [30] , is briefly 170 summarized. Then, a simple circuit to implement it as a distributed RA is proposed.
Optimal shortest path routing algorithm for generalized Kautz topologies
One of the most interesting results shown in [29] is that generalized de-Bruijn and generalized Kautz graphs have self-routing property. Moreover, there exist 175 a path of length m = ⌈log D (P )⌉ that connects any pair of nodes in the network. Besides, in [30] these properties are exploited to propose a shortest path RA to connect any pair of nodes. For each pair of nodes v and w the RA computes a tag t that is exploited to derive the shortest path. Namely, the tag is converted into an array of D-ary elements, whose length z ≤ m is the length of the path 180 and the routing path is derived as follows:
where n = 1, . . . , z and y z = v and y 0 = w. if z is odd then 8: 
else 10:
end if 12: if g < D z then 13: f ound ← TRUE
14:
for n = 1 to z do 15: if n is odd then 16:
else 18: t n ← g n
19:
end if 20: end for 21: else 22 :
end if 24: end while 25: end if
The steps to compute t are shown in Algorithm 1, where v and w are the source and the destination node respectively. This routing algorithm can be implemented both in lumped form -namely the source node computes all the 185 path from u to v -and in distributed form. In the second case, each node computes only one step of the routing, namely the step required to go one hop ahead to reach the destination. These concepts are detailed in the following paragraphs.
Digital circuit for distributed optimal shortest path routing

190
The shortest path from node v to node w can be computed exploiting the recurrence in (5). As it can be observed, this is a first order recurrence so there is no need to keep trace of the previous steps. Thus, when a packet arrives at node y n , only w and y n are required to compute y n−1 . That is the only information required to compute y n−1 is i) the current node y n and the destination node w. This property permits to reuse Algorithm 1 simply replacing v with y n . Moreover, the for loop at line 14 in Algorithm 1 reduces to the computation of t z−1 only. Another simplification is obtained by rewriting (5) as
where χ n = [D · (P − 1 − y n )] mod P is only a function of y n . As a consequence, for each node we can precalculate χ n and store it in a register. Then, once t n−1 200 is computed, y n−1 is calculated via a modulo P adder (see the right part of the dark gray shaded box in Fig. 2 ). Besides, t n−1 depends on w and y n (Algorithm 1) so it can be implemented with some constants and few logic as detailed in the following paragraphs (see the light gray shaded box in Fig. 2 ). The circuit that implements the shortest path routing relies on two steps:
representing g as D-ary array, whose elements are g n , and implementing (6) . The first step is implemented observing that the while loop at line 6 in Algorithm 1 is devoted to find both z, i.e. the length of the shortest path from node v to node w, and g, required to compute the tag t. The same holds true if we substitute v with y n . Since z ≤ m = ⌈log D (P )⌉, the search of z can be performed in parallel. Thus, the proposed circuit identifies m candidates for g, where the i-th candidate (g (i) ), according to lines 8 and 10 of Algorithm 1, is
where v has been replaced with y n . Each of the candidates is compared to a threshold (D i ) and a priority encoder selects g = g
can be inferred from (7), we can compute g (i) via a modulo P adder where the first operand is w and the second one is
which can be precalculated and stored in a register inside node y n . As shown in the light gray shaded box of Fig. 2 , each mod P adder relies on a subtracter and a multiplexer driven by the sign of w + k i − P : if the sign is positive then
The second step, depicted in the dark gray shaded box in Fig. 2 , relies on representing g as an array of at most m D-ary elements, where the element at position h = z − 1, that is g z−1 , is used to compute t z−1 as in lines 16 and 18 of Algorithm 1. Then, h 0 , which is the least significant bit of h, is used to select
Finally, with a modulo P operation we obtain
As it can be inferred from Algorithm 1 and Fig. 2 , significant complexity reduction can be achieved if both D and P are powers of two. Indeed, in this case both the generation of D-ary elements and modulo P operations are implemented with no hardware cost exploiting binary representation. An example 230 of the architecture achieved for D = 4 and P = 64 is shown in Fig. 3 where i) both comparators and priority encoder are implemented with few logic gates and ii) modulo P operations are obtained by letting the data wrap (when an overflow occurs). However, being both D and P powers of two, then (4) cannot be satisfied and some self-loops have to be tolerated. The impact of this choice 235 will be discussed in section 5, where experimental results are shown.
Case of study: intra-IP NoC for a turbo/LDPC decoder architecture
Turbo and LDPC codes are characterized by almost uniform traffic patterns both in time and space [44] . As a consequence, regular logarithmic diameter topologies, such as generalized Kautz topologies, that are able to minimize the distance between two nodes are an interesting option [19] . Besides, NoC based turbo and LDPC code decoder architectures are characterized by a deterministic traffic, imposed by the structure of the interleaver and parity check matrix respectively. The interleaving operation in turbo codes is scrambling the pro-245 cessing order of data within a block and relies on a permutation with almost uniform distribution [44] . The parity check matrix of LDPC codes is a very sparse binary matrix and the position of the '1' defines the mapping pattern among the PEs. These characteristics can be exploited precalculating all the routing and scheduling information by the means of a cycle accurate simulator, 250 such as [45] , which performs RTL simulations exploiting the SystemC simulation kernel. The simulator, that has been modified to support both turbo and LDPC codes, requires the following information: i) the topology description in the form of an adjacence matrix, ii) the traffic pattern (permutation law for turbo codes or parity check matrix for LDPC codes) iii) the routing algorithm routing element and stored in memories. Unfortunately, as shown in [46] , the 260 amount of memory required to implement such an approach is very large, being almost impractical in some real cases, such as for the HSDPA standard. As a consequence, routing and scheduling information have to be computed algorithmically on-the-fly, minimizing complexity and delay. Even if routing-tables can be used for this class of applications, the proposed low complexity circuit is a 265 scalable alternative.
The node architecture proposed in [18] relies on a PE, that performs the data computation, and a Routing Element (RE), that sends and receives data to/from the network. Each RE is made of a (D + 1) × (D + 1) crossbar, D + 1 input FIFOs and output registers, that are managed by an RA, and a scheduler 270 as depicted in Fig. 4 . As discussed in [19] shortest-path RAs coupled with Round-Robin (RR) and Longest-Queue-First (LQF) scheduling policies are the most suited solutions for channel code decoding applications.
In this work the intra-IP NoC-based multi-mode turbo/LDPC decoder architecture proposed in [18] is taken as a case of study. In particular, this ar-275 chitecture i) supports all WiMAX LDPC codes, that in the worst case imply a frame size N = 2304 and a code rate r = 0.5, and ii) relies on a D = 3, P = 22 nodes generalized Kautz topology where the shortest-path RA is distributed and performed via routing tables. It is worth noting that the frame size N corresponds to the number of packets sent over the network during an 280 iteration. In the following, the architecture presented in [18] is extended to the case D = 4, P = 32 and it will be referred to as Routing- Table- based NoC (RT-NoC) architecture. The RT-NoC architecture is then modified replacing the routing tables with the circuit-based RA described in Section 4.2. This new architecture will be referred to as RA-NoC. Both architectures have been place and route layout of the proposed intra-IP RA-NoC for D = 4, P = 32 is shown in Fig. 5 . Stemming from the design in [18] , the FIFOs in each node have 290 been conservatively sized to eight locations to prevent slowing the PEs. The structure of the packet is tailored around the application and relies on a header, containing the destination node (⌈log 2 P ⌉ = 5 bits), and a payload, containing one datum (6 bits) and the memory location where the datum will be stored at destination node (⌈log 2 (N · r/P ) = 6 bits). Thus, the data width of the NoC 295 is 17 bits and, since each packet is made of one flit only, then the system is deadlock-free.
Post place and route area and switching-activity-based power consumption are summarized for both architectures in Table 2 . As it can be observed the proposed RA-NoC features an area and a power consumption reduction of about 300 14% and 10% with respect to the RT-NoC. As discussed in section 4.2, choosing D and P as powers of two leads to a simplification in the RA circuit, but introduces self-loops. Since self-loops are not used, they have been removed in the implemented architecture. This simplification can reduce the decoder throughput as some nodes have less con-305 nections, namely with D = 4 and P = 32 there are twenty-eight nodes with 4 connections and four nodes with 3 connections. To explore this direction we observed that, if D is not a power of two, then converting g to an array of D-ary elements becomes complex. Moreover, if P is not a power of two, then modulo operations can be performed with a limited complexity, indeed they require a 310 subtracter and a multiplexer (see Fig. 2 ). Thus, we fixed D = 4 and P = 30, so that (4) is satisfied and self-loops are avoided, leading to a topology where all nodes have 4 connections, and implemented the corresponding NoC in the same conditions used for the case D = 4, P = 32. As it can be inferred from Table 3 a fair comparison is not straightforward. Indeed, experimental results in Table 3 , In [19] it is shown that logarithmic diameter topologies achieve higher through-325 put than the well known mesh topologies. However, this result was obtained using the same clock frequency for both topologies. It is largely recognized that mesh topologies, being highly regular can benefit from short and uniform interconnect delays and so they can reach higher clock frequencies than other topologies, such as Kautz ones. As a consequence, even if mesh topologies have mesh topology can run at a higher clock frequency than the generalized Kautz one. On the contrary, the generalized Kautz topology requires less clock cycles than the toroidal mesh topology for the decoding of the WiMAX LDPC code with frame size N = 2304 and code rate r = 0.5. As it can be inferred from Table 4 , the time to complete the message exchange phase for WiMAX LDPC
340
code N = 2304, r = 0.5 with the generalized Kautz topology and the toroidal mesh topology are 9.7 µs and 11µs respectively. This last result confirms the effectiveness of logarithmic topologies, such as the generalized Kautz ones, for low latency applications, as iterative channel code decoder architectures. Finally, in Fig. 6 the area and the power consumption of one routing-table-
345
based RE is compared with the area and the power consumption of one circuitbased RE as a function of P in the same test conditions employed for the full decoders. As it can be observed, as long as P increases the advantage of the circuit-based RE becomes larger, e.g. when P = 64 the area and power reductions are about 20% and 14% with respect to one table-based RE. It is 350 worth noting that, with a target clock frequency of 200 MHz the propagation delay of the routing decision calculation in one RE is 3.3 ns. However, the circuit can be pushed to run up to 950 MHz, accepting to increase the area by 2.5 times.
Conclusion
355
In this work a circuit to implement the optimal shortest-path RA proposed in [30] for generalized Kautz topologies is presented. The proposed circuit features lower complexity and power consumption than routing-table-based shortestpath RAs. Experimental results on the NoC-based design proposed in [18] show that the complexity of the NoC is reduced by about 14% and the power 360 consumption by about 10%. Compared with the well known toroidal mesh topology the proposed solution achieves higher throughput, being an interesting approach for high throughput and low latency applications. 
