Improving Network-on-Chip-based Turbo Decoder Architectures by Martina, Maurizio & Masera, Guido
Politecnico di Torino
Porto Institutional Repository
[Article] Improving Network-on-Chip-based Turbo Decoder Architectures
Original Citation:
M. Martina; G. Masera (2013). Improving Network-on-Chip-based Turbo Decoder Architectures.
In: JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL, IMAGE, AND VIDEO
TECHNOLOGY, vol. 73 n. 1, pp. 83-100. - ISSN 1939-8018
Availability:
This version is available at : http://porto.polito.it/2511690/ since: August 2013
Publisher:
Springer
Published version:
DOI:10.1007/s11265-013-0733-7
Terms of use:
This article is made available under terms and conditions applicable to Open Access Policy Article
("Public - All rights reserved") , as described at http://porto.polito.it/terms_and_conditions.
html
Porto, the institutional repository of the Politecnico di Torino, is provided by the University Library
and the IT-Services. The aim is to enable open access to all the world. Please share with us how
this access benefits you. Your story matters.
(Article begins on next page)
1 23
Journal of Signal Processing Systems
for Signal, Image, and Video Technology
(formerly the Journal of VLSI Signal
Processing Systems for Signal, Image,
and Video Technology)
 
ISSN 1939-8018
Volume 73
Number 1
 
J Sign Process Syst (2013) 73:83-100
DOI 10.1007/s11265-013-0733-7
Improving Network-on-Chip-based Turbo
Decoder Architectures
Maurizio Martina & Guido Masera
1 23
Your article is protected by copyright and all
rights are held exclusively by Springer Science
+Business Media New York. This e-offprint is
for personal use only and shall not be self-
archived in electronic repositories. If you wish
to self-archive your article, please use the
accepted manuscript version for posting on
your own website. You may further deposit
the accepted manuscript version in any
repository, provided it is only made publicly
available 12 months after official publication
or later and provided acknowledgement is
given to the original source of publication
and a link is inserted to the published article
on Springer's website. The link must be
accompanied by the following text: "The final
publication is available at link.springer.com”.
J Sign Process Syst (2013) 73:83–100
DOI 10.1007/s11265-013-0733-7
Improving Network-on-Chip-based Turbo Decoder
Architectures
Maurizio Martina · Guido Masera
Received: 9 February 2012 / Accepted: 1 February 2013 / Published online: 19 March 2013
© Springer Science+Business Media New York 2013
Abstract In this work novel results concerning Network-
on-Chip-based turbo decoder architectures are presented.
Stemming from previous publications, this work concen-
trates first on improving the throughput by exploiting
adaptive-bandwidth-reduction techniques. This technique
shows in the best case an improvement of more than
60 Mb/s. Moreover, it is known that double-binary turbo
decoders require higher area than binary ones. This charac-
teristic has the negative effect of increasing the data width
of the network nodes. Thus, the second contribution of this
work is to reduce the network complexity to support double-
binary codes, by exploiting bit-level and pseudo-floating-
point representation of the extrinsic information. These two
techniques allow for an area reduction of up to more than
the 40 % with a performance degradation of about 0.2 dB.
Keywords NoC · Turbo decoder · VLSI
1 Introduction
Today, modern telecommunications are a pervasive experi-
ence of data exchange among users and devices. One critical
aspect of this scenario is the continuous demand for higher
data rates, a problem that is exacerbated by the need for
reliable transmission of data. To that purpose, the push on
the so-called beyond-3G technologies, such as WiMAX [1]
M. Martina () · G. Masera
Dipartimento di Elettronica e Telecomunicazioni, Politecnico di
Torino, C.so Duca degli Abruzzi 24, 10129 Torino, Italy
e-mail: maurizio.martina@polito.it
G. Masera
e-mail: guido.masera@polito.it
and 3GPP-LTE [2], is a possible answer, where the reliabil-
ity is obtained exploiting effective error correcting codes,
such as turbo [3] and Low-Density-Parity-Check (LDPC)
[4] codes. Unfortunately, the decoding algorithms for these
codes are iterative making high throughput implementations
a challenging task [5, 6].
As shown in Table 1 in [7], several modern standards
for communications use turbo codes as a reliable channel
coding scheme. However, since these codes have limited
similarities, flexible architectures able to support differ-
ent standards are interesting solutions to achieve inter-
operability [8]. This direction has been investigated in
several works, such as [9–19] where not only flexibil-
ity but also high throughput, achieved by the means of
parallel architectures, is addressed. As an example [11,
13, 16, 20–22] deal with optimized Application-Specific-
Integrated-Circuit (ASIC) architectures where the flexibility
is limited to two standards: UMTS/WiMAX [11], 3GPP-
LTE/WiMAX [13, 21, 22], W-CDMA/CDMA2000 [20]
3GPP-LTE/HSDPA [16]. On the other hand, [9, 10, 14, 15,
23] are based on the Application-Specific-Instruction-set-
Processor (ASIP) approach, where optimized processor-like
architectures are designed with Coware Processor Designer
based on the LISA description language [24]. It is worth
observing that ASIP-based solutions allow for greater flex-
ibility than ASIC-based architectures, as they can support
several different codes and standards. In this direction
mixed Turbo/LDPC decoder architectures have been pro-
posed either as ASIC [17, 18] or ASIP [19] architectures.
However, mixed code decoder architectures are not dis-
cussed in this work. Moreover, as suggested in [14], ASIP
solutions are well suited to implement high throughput
multiprocessor turbo decoder architectures [7].
Several works, including [7, 25–29] address the problem
of communication flexibility in multistandard turbo decoder
Author's personal copy
84 J Sign Process Syst (2013) 73:83–100
architectures. Recently, in [30] we introduced the concept
of intra-IP Network-on-Chip (NoC), where the well known
NoC paradigm [31–34] is applied to the communication
structure of processing elements that belong to the same
IP [35]. Intra-IP NoC [7, 25–28] is a flexible solution to
enable multi-ASIP turbo decoder architectures. However, as
shown in detail in [7, 28], flexibility comes at the expense
of increasing the complexity of the decoder architecture. In
this work we improve the complexity/performance trade-off
of NoC-based turbo decoder architectures by reducing the
traffic load on the network as suggested in [36]. In [36] tech-
niques to reduce the traffic load are proposed and their effect
in terms of bandwidth reduction is analyzed. However, the
resulting improvement of throughput is not investigated. In
this work the actual effect of these techniques on a complete
NoC-based turbo decoder architecture is studied for the
first time. The adopted technique of traffic reduction offers
in the best case a throughput improvement of more than
60 Mb/s and 40 Mb/s for binary and double-binary codes
respectively. To the best of our knowledge this is the first
work showing the benefit of traffic reduction techniques in
an NoC-based turbo decoder architecture. Furthermore, we
exploit two known techniques [37, 38], originally proposed
to limit the amount of memory in turbo decoder architec-
tures, as possible solutions to reduce the complexity of the
NoC when double-binary turbo codes [39] are employed, as
in the WiMAX standard. To the best of our knowledge this
is the first work where the techniques proposed in [36–38]
are jointly used to design an NoC-based turbo decoder
architecture.
The paper is structured as follows: in Section 2 we recall
the equations required to implement the decoding algorithm,
whereas in Section 3 we describe the peculiar characteris-
tics of an NoC-based turbo decoder architecture, including
the architecture of routing elements, low-complexity routing
algorithms and topologies. Section 4 describes the exper-
imental setup we defined to increase the throughput and
reduce the area of NoC-based turbo decoder architectures
both in the case of binary and double-binary codes. To this
purpose we considered the HSDPA and the 3GPP-LTE stan-
dards for the case of binary codes, and the WiMAX standard
for the case of double-binary codes. Finally, in Section 5
conclusions are drawn.
2 Decoding Algorithms
Turbo code decoding algorithms are briefly reviewed in
the following. For the sake of brevity, this description does
not provide details that are not strictly required for the
understanding of the NoC based decoding architectures con-
sidered in the following sections of the paper. For a deeper
knowledge of decoding algorithms used for turbo codes the
reader can refer to [5].
Turbo codes are based on the concatenation (usually
parallel) of two constituent Convolutional Codes (CC)
(Fig. 1a), where CC1 process the the uncoded sequence u
in natural order (u1), whereas CC2 encodes the uncoded
sequence in scrambled order (u2) according to the inter-
leaver . On the other hand, the decoder is made of two
constituent decoders that exchange their data by means of
an interleaver () and a deinterleaver (−1), see Fig. 1b.
For the sake of brevity in the next paragraph we define the
symbols used in Fig. 1a and b without specifying if they are
related to CC1 or CC2.
The decoding algorithm of turbo codes is an iterative pro-
cess made of two half iterations, one for each constituent
decoder, where each half iteration is based on Maximum-
A-Posteriori (MAP) estimation achieved by means of the
BCJR algorithm [40], where Log-Likelihood-Ratio (LLR)
representation is usually adopted [41]. During each half
iteration one constituent decoder reads from a memory the
intrinsic information λk[c] received from the channel and
the a-priori information λaprk [u] to produce extrinsic infor-
mation λextk [u]. Extrinsic information computed in one half
iteration becomes the a-priori information for the next half
iteration. Based on the trellis notation shown in Fig. 1c
and said U the set of uncoded symbols, each constituent
MAP decoder, often referred to as Soft-In-Soft-Out (SISO)
module, computes
λextk [u] =
∗
max
e:u(e)=u
{bext (e)}− ∗max
e:u(e)=u˜
{bext (e)}−λaprk [u] (1)
SISO1 SISO2
(b)
(c)
(a)
CC1
CC2
βk− 1 βk
αk− 1 αk
e
u(e), c(e)
sS (e)
sE (e)
k
Π
Π− 1
λextk [u1]
λk [c2]
λaprk [u2]
λextk [u2]λaprk [u1]
λk [c1]
u
Π
u1 c1
c2u2
c
Figure 1 Parallel concatenation of two convolutional codes: encoder
(a), decoder (b), notation for a trellis section (c).
Author's personal copy
J Sign Process Syst (2013) 73:83–100 85
where u˜ ∈ U is a uncoded symbol taken as a reference (usu-
ally u˜ = 0), u ∈ U \ {u˜}, k is a trellis step, e is a transition in
a trellis step and u(e) is the corresponding uncoded symbol.
Thus, λextk [u] and λaprk [u] are extrinsic and a-priori informa-
tion respectively for symbol u at trellis step k expressed as
LLRs. The ∗max{xi} is usually implemented as a tree struc-
ture where each node of the tree is a 2-input ∗max function:
∗
max{x1, x2} = max{x1, x2} + fc(|x1 − x2|). (2)
As it can be observed, the fc(·) term in Eq. 2 is a correc-
tion term, that is often precalculated and stored in a small
Look-Up-Table (LUT) [42, 43]. The correction term, usu-
ally adopted when decoding binary codes (Log-MAP), can
be omitted for double-binary turbo codes with minor error
rate performance degradation (Max-Log-MAP). In other
words, each ∗max{·} function in Eq. 1 selects at each trellis
step the maximum1 metric bext (e) among the ones whose
uncoded symbol is u or u˜ respectively.
The term bext (e) in Eq. 1 is defined as:
bext (e) = αk−1[sS(e)] + γ extk [e] + βk[sE(e)] (3)
αk[s] = max
e:sE(e)=s
{
αk−1[sS(e)] + γk[e]
}
(4)
βk[s] = max
e:sS(e)=s
{
βk+1[sE(e)] + γk[e]
}
(5)
γk[e] = λk[u(e)] + λk[cu(e)] + λk[cp(e)] (6)
γ extk [e] = λk[cp(e)] (7)
where sS(e) and sE(e) are the starting and the ending states
of e, αk[sS(e)] and βk[sE(e)] are the forward and backward
metrics associated to sS(e) and sE(e) respectively. In other
other words, each α (β) metric is the ∗max value among the
path metrics whose transition ends to (starts from) state s.
The terms λk[u(e)], λk[cu(e)] and γ extk [cp(e)] are obtained
adding the corresponding a-priori, intrinsic systematic and
intrinsic parity LLRs respectively.
In a parallel decoder P SISOs operate concurrently on
disjoint portions of the trellis. Said N the number of trellis
steps processed by each constituent decoder, we have that
each SISO operates on a trellis slice made of N/P steps.
As a consequence, we can extend the notation introduced in
the previous paragraph to a parallel decoder, where λexti,j [u]
is the extrinsic information produced by SISO i at the j-th
trellis step. Moreover, we observe that N is also the number
of elements exchanged by means of  and −1. For further
details on the decoding algorithm the reader can refer to [5].
1If Log-MAP algorithms are used the result is not actually the maxi-
mum among the input values due to fc(x).
3 NoC-based Turbo Decoder Architectures
An NoC-based turbo decoder architecture can be repre-
sented as a graph of P nodes where each node is made of
a Routing Element (RE) and a Processing Element (PE)
(see Fig. 2). Each PE, devoted to perform the processing
required by the BCJR algorithm, i.e. Eqs. 1–7, contains a
SISO processor and two memories where intrinsic and a-
priori information are stored respectively. On the other hand,
each RE has a simple structure made of M input buffers
(FIFOs), an M × M crossbar switch and M output regis-
ters. REs are devoted to route the data produced by PEs to
the correct destination node according to  and −1. To
this purpose we introduce d(i, j) as the destination node,
or destination identifier (ID) of λexti,j [u]. In order to com-
plete a half iteration, λexti,j [u] is stored at the location t (i, j)
in the a-priori information memory of node d(i, j). Since
the network does not guarantee the delivery-order of the
extrinsic information, in Fig. 2 we refer to extrinsic informa-
tion coming from network as λˆexti,j [u]. Similarly, we refer to
the memory location where λˆexti,j [u] is stored as tˆ (i, j). It is
worth observing that d(i, j) and t (i, j) are univocally deter-
mined by the interleaver (deinterleaver) sequences, namely
given , −1 and P we have
d(i, j) =
⌊
P
N
· 
(
i · N
P
+ j
)⌋
(8)
t (i, j) = 
(
i · N
P
+ j
)
mod
N
P
(9)
where · is the next lowest integer value and (·) can
be either (·) or −1(·) depending on the current half
iteration.
In general PEs and REs can operate at different rates,
thus, to decouple the design of PEs and REs we define R
as the number of packets injected in the network in a clock
cycle. As a consequence, R = 1 means that each PE injects
in the network one new packet per clock cycle, whereas
R = 0.5 means that a new packet is injected in the net-
work every two clock cycles. It is worth noting that the
case R = 1 corresponds to REs and PEs working at the
same clock frequency (isochronous), with PEs able to out-
put new packet of extrinsic information at each clock cycle.
On the contrary, R < 1 models either an isochronous sys-
tem where PEs output less that one packet per clock cycle or
a mesochronous system where REs work at a higher clock
frequency that PEs.
3.1 RE Architectures
In [28] three possible architectures for REs (see Fig. 2),
referred to as Fully-Adaptive (FA), All Precalculated (AP)
Author's personal copy
86 J Sign Process Syst (2013) 73:83–100
Location
Memory
Location
Memory
PE
read enable loadcrossbar
conf.
RE
(a) (c)(b)
RA RA
SISO i MEM i
λexti,j [u] ˆλexti,j [u]
input
[0, M − 1]
output
[0, M − 1]
RE i
PE i
M − 1 M − 1
t ′(i, j )
RE i
PE i
M − 1 M − 1
ˆλexti,j [u]
RE i
PE i
M − 1 M − 1
ˆt(i, j )
ˆλexti,j [u]
ˆt(i, j )
ˆλexti,j [u]
λexti,j [u] λexti,j [u] λexti,j [u]
RM
t(i, j )d(i, j ) d(i, j )
Figure 2 Node block scheme: a FA architecture, b AP architecture, c PP architecture.
and Partially Precalculated (PP) architectures were pre-
sented. All these architectures are intended to support dis-
tributed routing, namely each RE computes only the next
link to send the packet towards the destination ID. Sev-
eral works in the literature proposed efficient algorithmic
implementations of distributed routing. As an example,
in [44, 45] Logic-Based-Distributed-Routing (LBDR) and
some improved versions are studied, whereas in [46] a
general methodology for generating deadlock-free routing
algorithms in irregular topologies is presented. On the other
hand, since this work aims to improve the application spe-
cific system proposed in [28], the routing algorithm is based
on forwarding tables for fair comparison.
The FA architecture (Fig. 2a) sends on the network
packets of data made of a header, containing d(i, j) and
a payload containing λexti,j [u] and t (i, j). The data are
routed by the means of a Routing Algorithm (RA) based
on forwarding tables. The AP architecture (Fig. 2b) stems
from Eqs. 8 and 9 observing that for each node we can
precalculate and store in a Routing Memory (RM) and in a
Location Memory the routing information (i.e read enable
signals for FIFOs, crossbar configuration and load signals
for registers) and tˆ (i, j), the location where the received
value λˆexti,j [u] will be stored, respectively. Thus, with the
AP architecture we reduce the width of the data bus at
the expense of some extra memory. In the PP architecture
(Fig. 2c) only tˆ (i, j) sequences are precalculated. Thus, PP
requires a narrower data width than the FA architecture, but
less memory than the AP one.
To improve the throughput/area figures of NoC-based
turbo decoder architecture we infer from [28] two main
results:
• The AP architecture can be conveniently used with
complex routing algorithms to concurrently maximize
the throughput and minimize the area. Indeed, all read
enable, crossbar configuration and load signals are
computed off-line and stored in the RM. Unfortunately,
as pointed out in [7] this comes at the expense of a sig-
nificant amount of external memory to store the routing
information; as an example to support all the inter-
leavers specified by the HSDPA standard [47] about
64 MB of memory are required.
• As long as the network is faster than the PEs (R < 1),
throughput and area figures tend to be independent of
the routing algorithm.
Thus, both FA and PP architectures with simple RAs, based
on forwarding tables, should be further investigated. In
particular, the performance of the FA architecture can be
improved by using Adaptive Bandwidth Reduction (ABR)
techniques as the one proposed in [36], namely avoiding
Author's personal copy
J Sign Process Syst (2013) 73:83–100 87
the exchange of unnecessary extrinsic information values.
This distinguishing feature of the FA architecture, that is
not available with AP and PP architectures, is detailed in
Section 4.1. On the contrary, the PP architecture features
a narrower data bus than the FA one, however, it requires
some external memory to store the configurations of all the
Location Memories. Moreover, in several standards, such as
HSDPA, 3GPP-LTE and WiMAX, the generation of d(i, j)
and t (i, j) sequences can be obtained algorithmically with
simple architectures [13, 48, 49]. As a consequence, the FA
architecture can also take advantage of this feature to reduce
the complexity of the whole decoder.
3.2 Low Complexity RAs
In order to increase the throughput and reduce the area of
the decoder, RAs should be based on simple, deadlock-free
and livelock-free routing policies than can be implemented
with few logic and completed in one clock cycle. Simple
routing architectures as LBDR and its improved versions
[44, 45] are interesting solutions to reduce the complexity
of routing elements. However, they are designed to work
with topologies derived from the 2D-mesh. Unfortunately,
this is not the case of the topologies studied in this work
and in previous works we compare with, e.g. [7, 28]. In par-
ticular, in [28] it is shown that logarithmic topologies, e.g.
De-Bruijn and Kautz topologies, are the most suited ones
for NoC-based turbo decoder architectures as they reduce
the latency overhead added by the interconnection structure
with lower complexity than mesh-based topologies. As a
consequence, in the following we will focus on distributed
shortest-path routing based on forwarding tables for Kautz
topologies. It is worth pointing out that, since the traffic
pattern in turbo decoder architectures is imposed by the
interleaver, buffers can be sized off-line through simulations
[50]. As a consequence, a buffer or a link is always avail-
able avoiding deadlock. Moreover, the use of shortest-path
routing prevents livelocks.
3.2.1 RAs Description
In a P nodes network, each node contains one or more tables
with P entries. The value of the i-th entry is the output port
connected to a node that is on one shortest path from current
node to node i. As a consequence, the value of each entry
depends on both the ID of current node and the destination
ID of current packet. The value for each entry is obtained
off-line by means of the Floyd-Warshall-Algorithm (FWA)
[51]. Running the FWA on pruned versions of the network
graph until no more local paths exist between one node and
the adjacent ones we obtain the entries of all the tables. As
an example in Fig. 3 it is shown that, in the generalized
Kautz topology with P = 8 and D = 4, two equivalent
100
110
000
001
101
111
011
010
Figure 3 Shortest path routing algorithm example.
paths exist to go from node 0 to node 1 (through node 5 or
node 7). Let us assume output ports numbered clockwise,
that is for node 0 the link to node 6 is on port 0 and the link
to node 7 is on port 3 (the link to the local SISO is always
on port D). We build up to two forwarding tables for each
node. As an example, from Fig. 3 we infer that the second
entry (destination node 1) in the forwarding tables of node 0
are 2 and 3 respectively, the output ports connected to nodes
5 and 7. It is worth noting that when a message reaches the
destination node it is routed to output port M − 1, that is
directly connected to the extrinsic information memory (see
Fig. 2).
To limit the complexity of the node in [28] is suggested to
use only one forwarding table with FA and PP architectures.
This solution is referred to as Single-Shortest-Path (SSP).
On the other hand, with AP node architectures the complex-
ity of multiple forwarding tables is hidden by the off-line
computation of the routing memory content. Thus, in [28]
local shortest paths are exploited to spread the traffic over
the network. This solution referred to as All-local-Shortest-
Path (ASP) aims at reducing the network congestion. SSP
and ASP solutions are coupled with different policies for
serving input buffers. In the SSP case simple serving poli-
cies, suitable for NoC-based turbo decoder architectures,
are used: Round-Robin (RR) and FIFO-length (FL). RR is
based on a circular serving policy, whereas with FL poli-
cies each input is served considering the number of elements
stored in its input buffer, namely FL sorts the input buffers
Author's personal copy
88 J Sign Process Syst (2013) 73:83–100
according to the number of stored elements, then it serves
them in decreasing order. In the ASP case an enhanced ver-
sion of FL, referred to as FL-with-Traffic-spreading (FT),
is used instead. The basic idea is to use a counter for each
entry of the forwarding tables to obtain accurate information
on the use of each path. Then, when two paths are equiv-
alent (i.e. they are both shortest paths) the less used one is
selected. Since the ASP-FT technique is rather complex it is
implemented on AP nodes only and off-line simulations are
run to obtain the content of the routing memory.
3.2.2 RAs Architecture
Input buffer management strategies and single-shortest-path
routing algorithm described in previous paragraphs have
been implemented as in [28]. The RR is implemented as a
simple circular shift-register that is enabled when there is at
least one element in the input FIFOs. On the other hand, FL
relies on a small sorting network that, based on the number
of elements into the input FIFOs, computes the serving
order. At each node the forwarding table is implemented
as an M-input-M-output LUT that converts M destinations
(dsti) to the corresponding output ports (dporti). Then,
given that S0 and SM−1 are the first and the last served
input FIFOs, M reservation blocks update an M-position
binary mask to avoid collisions on output ports. Finally,
a priority decoder implements the selected priority and
FIFO management policies by properly generating the read
enable (reni) signals for the FIFOs and the configuration
command (adxi) for the crossbar (see the bottom-right part
of Fig. 4). Since the load enable for the registers (lei) and
adxi signals must be asserted the clock cycle after reni ,
an early signal for the crossbar configuration (ladxi) is
generated and then delayed by means of registers. In partic-
ular, lei and adxi are obtained by delaying the reni and
ladxi by one clock cycle.
Reservation Block According to the order defined by the
S0, . . .SM−1 sequence, each reservation block (bottom-left
part of Fig. 4) receives the dporti signals, generates a
reservation signal (reservei) and specifies the output
port to be reserved (porti). The reservation is obtained
by updating the rmask, which contains a ‘1’ in the
position of a reserved output port and a ‘0’ in the posi-
tion of a free output port. Each reservation block generates
porti = dportSi , that is converted by a one-hot decoder
into a mask with a ‘1’ in position porti . The reserva-
tion mask is updated (output rmask) by comparing this
mask with input rmask: if input rmask contains a
‘0’ in position porti , then reservei goes to ‘1’.
Priority Decoder The priority decoder is made of two
blocks: the read-enable generator (upper-left part of
Fig. 4) and the destination-port generator (upper-right part
of Fig. 4).
The read-enable generator is based on few logic gates
that implement reni = reserveSi when FIFO i is not
empty (FIFOemptyi = ‘0′). This is obtained by com-
bining Si one-hot representation with the corresponding
reserve signal.
The destination-port generator is an array of multiplex-
ers, where each multiplexer in position i,i implements
ladxi = porti when reni = ‘1′. On the other
hand, ladxi must take the value of an un-reserved output
port when reni = ‘0′. This is obtained by means of the
permutation network implemented by the multiplexers in
position j,i with j = i whose outputs (muxj,i) are
muxj,0 =
{
0 if port0 = j
j otherwise (10)
and for i > 0
muxj,i =
{
muxi,i−1 if porti = j
muxj,k otherwise
(11)
where
k =
{
i − 2 if j = i − 1
i − 1 otherwise (12)
and if k < 0, then muxj,k = 0.
3.3 NoC Topologies
In [28] several fixed degree topologies for NoC-based turbo
decoder architectures are considered. However, since  and
−1 tend to spread almost uniformly λexti,j [u], the traffic pat-
tern on the network is almost uniform too. Experimental
results in [28] show that topologies with logarithmic diam-
eter as generalized De-Bruijn [52] and generalized Kautz
[53] achieve higher throughput and require lower area than
other well known fixed degree topologies such as ring, hon-
eycomb and toroidal-mesh ones. It is worth noting that
Kautz topologies (shown in Fig. 5 for P = 16) have also
been recently suggested as a viable solution to improve per-
formance and latency of NoCs exploiting 3D VLSI design
[54, 55].
4 Experimental Setup
Since in this work we aim at increasing the through-
put and reducing the area of NoC-based turbo decoder archi-
tectures, we focus on the most significant cases discussed
in Section 3, namely FA node architecture with SSP-RR
Author's personal copy
J Sign Process Syst (2013) 73:83–100 89
0
1
0
1
0
1
one−hot
decoder
one−hot
decoder
one−hot
decoder
one−hot
decoder cmp
1 0 reservation
block
reservation
block
reservation
blockLUT
priority
decoder
port M − 1
ren1ren0 renM − 1
FIFO emptyM − 1
FIFO empty1
FIFO empty0
0
1
1
2
1
M − 1
M − 2
0
1
1
2
2
0
2
2
0
0
1
2
M − 2
M − 1
0
1
2
M − 1
M − 2
ren 0
ren 1
ren 2
ladx0
ladx1
ladx2
port 1port 0 port 2 port M − 2S0
reserve 0
S1
reserve 1
SM − 1
reserveM − 1
dport 1
dport 0
dport M − 1
Si
output rmask
reserve i
input rmask
porti
dst M − 1
dst 0
dst 1
0 1 M − 1
‘0’ ‘0’
rmask
rmask
reserve0
reserve1
reserveM − 1
port0
port1
port M − 1
dport 1
dport M − 1
S0
dport 0 S1
SM − 1
S1S0 SM − 1starting rmask
FIFO empty0
FIFO empty1
FIFO emptyM − 1
renM − 1 leM − 1
ren0
ren1
le0
le1
adx 0
adx 1
adx M − 1
ladx0
ladx1
ladxM − 1
‘0’
Figure 4 Single-shortest-path routing architecture.
generalized Kautz D=4
generalized Kautz D=2 generalized Kautz D=3
0000
1000 0100 1001 1101
1100
1110
11110101
1011 01110110
0011
0001
1010
0010
0001
1100
1000 0100 1001 1101 1110
1111010110100000
0010 0110 1011 0111
0011
0000
1000
1100
0100 1001 1110
1010
0001 0010 0110
0101
1011
0011
0111
1111
1101
Figure 5 Example of considered Kautz topologies for P = 16.
Author's personal copy
90 J Sign Process Syst (2013) 73:83–100
Figure 6 BER performance of
the HSDPA N = 5114 turbo
decoder with ABR technique for
different values of K.
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.810
-9
10-8
10-7
10-6
10-5
10-4
10-3
10-2
10-1
100
SNR [dB]
BE
R
no ABR
K = 4
K = 6
K = 8
K = 10
K = 12
K = 14
K = 16
K = 18
K = 20
K = 22
K = 24
K = 26
K = 28
and SSP-FL routing. Moreover, we consider only gener-
alized Kautz topologies, as they have logarithmic diame-
ter and less self-loops2 than generalized De-Bruijn ones
[28, 52, 53]. The degree of the network D = M − 1 ranges
in {2, 3, 4} and the parameter R varies in {0.33, 0.5, 1}. Then
we simulated both HSDPA and 3GPP-LTE interleavers for
the case of binary turbo codes. Furthermore, we simulated
the double-binary turbo code used in the WiMAX standard
as well.
In the following the throughput is computed as
T = Nb · fclk
I · (Ncyc0 + Ncyc1
) (13)
where Nb is the number of decoded bits, fclk is the clock
frequency, I is the number of iterations, Ncyc0 and N
cyc
1
are the number of clock cycles required to complete the
interleaved and deinterleaved half iterations respectively. It
is worth pointing out that Nb = N for binary codes and
Nb = 2N for double-binary codes. Results shown in the
following sections have been obtained for fclk = 200 MHz
and I = 8 with the Turbo-NoC simulator [50] and Synop-
sys Design Compiler for a 130 nm standard cell technology.
The complexity characterization shown in this work does
2If we model a topology as a graph, a self-loop is an edge whose source
and destination nodes coincide.
not consider post place and route area overhead [56]. How-
ever, as pointed out in [57] and [58] for regular topologies,
at least at the 130 nm technology node, the area occupation
of logic cells in the design gives a useful indication about the
actual complexity of the NOC. Moreover, results are avail-
able in the literature showing that a low area layout can be
generated for logarithmic diameter networks. For example
in [59] and [60] an optimal layout algorithm is run on gener-
alized de-Bruijn topologies leading to a layout comparable
and in some cases smaller than the one required by toroidal
meshes. More in general, since the layout area of a network
x can be expressed as (|B(x)|2min) [61], where |B(x)|min
is the minimum bisection width of the network, this model
can be used to roughly compare different topologies. For
example, |B(T)|min = 2
√
P and |B(B)|min = (P/ log P)
for toroidal mesh and for de-Bruijn topologies respectively:
this shows that the routing overhead of logarithmic diame-
ter networks is similar to that of toroidal meshes, at least up
to 64 nodes.
4.1 ABR in NoC-based Turbo Decoder Architectures
Since ABR techniques reduce the traffic in the network,
they reduce the latency of the decoder, i.e. Ncyc0 and N
cyc
1
in Eq. 13. As a consequence, they are suited to increase the
throughput of an NOC-based turbo decoder. This approach
is similar to well known early stopping criteria that are rou-
tinely used to both increase the throughput and reduce the
Author's personal copy
J Sign Process Syst (2013) 73:83–100 91
Figure 7 BER performance of
the 3GPP-LTE N = 6144 turbo
decoder with ABR technique,
K = 4, 6, 8, 10.
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.810
−10
10−9
10−8
10−7
10−6
10−5
10−4
10−3
10−2
10−1
SNR [dB]
BE
R
no ABR
K = 4
K = 6
K = 8
K = 10
power consumption in turbo decoder architectures [62–64].
However, most of related works focus on frame-level early
stopping criteria. On the contrary, bit-level/symbol-level
early stopping criteria [65] take into account that the
reliability of each bit/symbol in a frame converges at differ-
ent speed. As a consequence, when the extrinsic information
Figure 8 Average throughput
improvement at different SNR
values of the HSDPA N = 5114
turbo decoder with
K = 4, 6, 8, 10 on generalized
Kautz networks D = 2 and
P = 64.
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8155
160
165
170
175
180
185
SNR [dB]
T 
[M
b/s
]
FL no ABR
FL K = 4
FL K = 6
FL K = 8
FL K = 10
RR no ABR
RR K = 4
RR K = 6
RR K = 8
RR K = 10
Author's personal copy
92 J Sign Process Syst (2013) 73:83–100
Figure 9 Average throughput
improvement at different SNR
values of the HSDPA N = 5114
turbo decoder with
K = 4, 6, 8, 10 on generalized
Kautz networks D = 3 and
P = 64.
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8275
280
285
290
295
300
305
310
SNR [dB]
T 
[M
b/s
]
FL no ABR
FL K = 4
FL K = 6
FL K = 8
FL K = 10
RR no ABR
RR K = 4
RR K = 6
RR K = 8
RR K = 10
of a certain bit/symbol meets a proper reliability criterion,
it is not necessary to further refine it. From an NoC-based
turbo decoder perspective, this means that reliable λexti,j [u]
are no longer sent over the network.
4.2 HSDPA and 3GPP-LTE Case of Study
For binary turbo codes, as the ones employed in HSDPA and
3GPP-LTE standards, a simple ABR technique is obtained
Figure 10 Average throughput
improvement at different SNR
values of the HSDPA N = 5114
turbo decoder with
K = 4, 6, 8, 10 on generalized
Kautz networks D = 4 and
P = 64.
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
355
360
365
370
375
380
385
390
395
SNR [dB]
T 
[M
b/s
]
FL no ABR
FL K = 4
FL K = 6
FL K = 8
FL K = 10
RR no ABR
RR K = 4
RR K = 6
RR K = 8
RR K = 10
Author's personal copy
J Sign Process Syst (2013) 73:83–100 93
Table 1 Throughput [Mb/s] - area [mm2] achieved with the HSDPA N = 5114 and LTE N = 6144 interleavers, with generalized Kautz
topologies for P ∈ {8, 16, 32, 64}, R ∈ {0.33, 0.5, 1}, SSP-RR, SSP-FL and ASP-FT routing algorithms, no ABR.
D = 2, HSDPA D = 2, LTE
P = 8 P = 16 P = 32 P = 64 P = 8 P = 16 P = 32 P = 64
R = 1.00 SSP-RR (FA) 54 - 3.13 74 - 5.29 105 - 7.36 159 - 9.71 53 - 3.69 71 - 6.25 101 - 8.76 140 - 11.14
SSP-FL (FA) 58 - 3.16 81 - 5.04 117 - 6.65 171 - 8.48 54 - 3.88 75 - 5.93 109 - 7.79 151 - 9.74
ASP-FT (AP) 58 - 1.53 81 - 2.51 117 - 3.40 171 - 4.51 54 - 2.01 75 - 3.01 109 - 4.00 151 - 5.14
R = 0.50 SSP-RR (FA) 46 - 0.59 72 - 1.71 101 - 4.07 142 - 6.64 44 - 0.64 66 - 1.95 90 - 4.63 120 - 7.44
SSP-FL (FA) 46 - 0.56 78 - 1.26 112 - 3.39 156 - 5.76 44 - 0.61 72 - 1.37 100 - 3.79 131 - 6.31
ASP-FT (AP) [7] 46 - 0.62 78 - 1.01 112 - 2.06 156 - 3.40 44 - 0.71 72 - 1.13 100 - 2.34 131 - 3.72
R = 0.33 SSP-RR (FA) 31 - 0.54 58 - 0.86 92 - 2.01 129 - 4.57 29 - 0.59 52 - 0.90 79 - 2.25 102 - 4.84
SSP-FL (FA) 31 - 0.53 58 - 0.81 101 - 1.56 140 - 3.68 29 - 0.58 52 - 0.85 86 - 1.53 112 - 3.79
ASP-FT (AP) 31 - 0.62 58 - 0.88 101 - 1.36 140 - 2.54 29 - 0.72 52 - 0.98 86 - 1.44 112 - 2.69
D = 3, HSDPA D = 3, LTE
R = 1.00 SSP-RR (FA) 84 - 1.74 111 - 2.81 188 - 4.51 279 - 7.06 75 - 1.89 107 - 3.29 161 - 5.07 232 - 8.08
SSP-FL (FA) 90 - 1.00 126 - 2.54 194 - 4.44 291 - 6.82 86 - 0.90 116 - 2.95 171 - 5.05 240 - 7.82
ASP-FT (AP) 90 - 0.75 142 - 1.34 207 - 2.37 298 - 3.71 86 - 0.76 132 - 1.50 185 - 2.69 254 - 4.18
R = 0.50 SSP-RR (FA) 46 - 0.62 87 - 0.99 151 - 1.92 230 - 3.83 44 - 0.66 78 - 1.00 128 - 1.81 178 - 3.96
SSP-FL (FA) 46 - 0.61 86 - 0.96 152 - 1.77 237 - 3.41 44 - 0.65 78 - 0.95 129 - 1.64 183 - 3.40
ASP-FT (AP) 46 - 0.70 87 - 0.96 152 - 1.45 238 - 2.42 44 - 0.79 78 - 1.04 129 - 1.48 186 - 2.45
R = 0.33 SSP-RR (FA) 31 - 0.59 58 - 0.91 103 - 1.65 167 - 3.15 29 - 0.64 52 - 0.94 87 - 1.59 129 - 3.06
SSP-FL (FA) 31 - 0.59 58 - 0.90 103 - 1.62 167 - 3.02 29 - 0.61 52 - 0.92 87 - 1.53 128 - 2.85
ASP-FT (AP) 31 - 0.66 58 - 0.93 103 - 1.43 167 - 2.33 29 - 0.74 52 - 1.00 87 - 1.46 129 - 2.35
D = 4, HSDPA D = 4, LTE
R = 1.00 SSP-RR (FA) 75 - 1.72 156 - 2.17 191 - 4.01 356 - 5.84 66 - 1.98 133 - 2.40 167 - 4.40 301 - 6.03
SSP-FL (FA) 83 - 1.31 163 - 1.63 199 - 3.89 372 - 5.51 76 - 1.45 151 - 1.44 183 - 4.16 312 - 5.68
ASP-FT (AP) 90 - 0.74 163 - 1.12 246 - 1.99 372 - 3.31 86 - 0.78 151 - 1.12 217 - 2.10 312 - 3.45
R = 0.50 SSP-RR (FA) 46 - 0.64 87 - 1.08 152 - 2.03 246 - 3.87 44 - 0.67 79 - 1.04 130 - 1.87 192 - 3.39
SSP-FL (FA) 46 - 0.62 87 - 1.05 152 - 1.94 245 - 3.80 44 - 0.66 79 - 1.00 129 - 1.77 192 - 3.20
ASP-FT (AP) 46 - 0.73 87 - 1.04 152 - 1.63 245 - 2.77 44 - 0.84 79 - 1.13 130 - 1.65 192 - 2.66
R = 0.33 SSP-RR (FA) 31 - 0.61 58 - 1.02 103 - 1.86 170 - 3.53 29 - 0.65 53 - 1.01 87 - 1.75 130 - 3.16
SSP-FL (FA) 31 - 0.61 58 - 0.99 104 - 1.82 170 - 3.51 29 - 0.62 53 - 0.97 87 - 1.69 130 - 3.04
ASP-FT (AP) 31 - 0.69 58 - 1.01 104 - 1.59 170 - 2.69 29 - 0.77 53 - 1.05 87 - 1.60 130 - 2.61
by fixing a threshold K that is compared with δ = |λexti,j [u]−
λ
apr
i,j [u]|, namely if δ < K , then λexti,j [u] is not sent. The
choice of K depends not only on the specific code con-
sidered but also on the quantization parameters used to
represent λexti,j [u] and on the performance loss in terms of
Bit-Error-Rate (BER) that can be accepted. In the follow-
ing we consider N = 5114 for HSDPA and N = 6144 for
3GPP-LTE respectively. In both cases the extrinsic infor-
mation is represented on eight bits whereas the intrinsic
information is represented on six bits with three fractional
bits. Both decoders perform eight iterations (I = 8) with
P = 64 using the Log-MAP algorithm [42] with a LUT-
stored correction term. In Figs. 6 and 7 we show the BER
performance for the HSDPA and 3GPP-LTE codes respec-
tively obtained by applying the ABR technique described in
the previous paragraph with several values3 for K. In par-
ticular, in Fig. 6 we show for the HSDPA code that when
K > 10 the performance worsens significantly. As an exam-
ple, with K = 10 there is a performance loss of less than
0.1 dB in the waterfall region and nearly ideal performance
when the code floors. On the other hand, with K = 16
the performance loss is of about 0.2 dB in the waterfall
region and the code floor is shifted to higher SNR values
of about 0.2 dB as well. Similar results were observed for
the 3GPP-LTE code, so, for the sake of clarity, in Fig. 7
only results obtained with K = 4, 6, 8, 10 are shown. For
3Since we use three fractional bits for data representation the integer
values of K we considered correspond to 0.25, 0.75, 1 and so on.
Author's personal copy
94 J Sign Process Syst (2013) 73:83–100
Figure 11 BER performance of
the WiMAX N = 1920 turbo
decoder with SL, BL, PFP
representation and ABR
technique, K = 4, 6.
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.810
-7
10-6
10-5
10-4
10-3
10-2
10-1
100
SNR [dB]
BE
R
SL
BL
BL nξ = 6
BL nξ = 5
BL nξ = 4
BL nξ = 4 K = 4
BL nξ = 4 K = 6
both cases we obtained the corresponding best and aver-
age bandwidth reduction at different SNR values through
Monte Carlo simulations4. Experimental results show that
the throughput increase is significant when there is a high
load on the network (R = 1) either using FL or RR input
FIFO management. In particular, in Figs. 8, 9 and 10 we
show the average throughput increase for the HSDPA turbo
decoder for D = 2, 3, 4 respectively with different values
of K. As it can be observed when R = 1 there is an average
throughput increase, with respect to a decoder where ABR
is not applied, that ranges from about 5 to 20 Mb/s for the
HSDPA turbo decoder. Furthermore, we observed that in the
best case there is a throughput increase of at least 60 Mb/s.
On the other hand, when R < 1 the average throughput
improvement is at most of 5 Mb/s. Similar results have been
obtained for the LTE turbo decoder.
To complete the comparison, we show in Table 1 the
throughput/area results for the HSDPA and LTE cases
respectively, where the results for the HSDPA case with
ASP-FT routing algorithm and AP node architecture are
taken from [7]. For the sake of clarity we present only area
results concerning the interconnection network, namely all
the REs, and we suggest the reader to refer to specific papers
as ASIC and ASIP implementations show significant differ-
ences in terms of complexity [9–11, 13–16, 20–23]. As it
can be observed the significant throughput increase obtained
4The worst case corresponds to simulations where ABR is not applied
with the ABR technique on the FA node architecture when
R = 1 is paid as an area overhead with respect to the AP
node architecture. However, as pointed out in [7], the AP
node architecture requires a large external memory to store
the routing information. Moreover, the difference in terms
of area between FA and AP node architectures reduces when
R < 1. In particular, as shown in Table 1, when R = 0.33
with P = 8 and P = 16 the FA node architecture with
the SSP-FL routing algorithm requires less area than the
AP one.
4.3 WiMAX Case of Study
Simulation results shown in this section have been obtained
with N = 1920, as in [38]. Each component of the extrinsic
information is represented on eight bits whereas the intrinsic
information is represented on six bits with two fractional
bits. The decoder performs eight iterations (I = 8) with
P = 64 using the Max-Log-MAP algorithm [42].
Since in binary turbo codes U = {0, 1}, the LLR of the
extrinsic information is a scalar value. On the other hand,
for double-binary turbo codes U = {00, 01, 10, 11}, as a
consequence λexti,j [u] is an array containing three elements.
In [38], a bit level double-binary turbo decoder architec-
ture is proposed to reduce the amount of memory to store
the extrinsic information. The same idea is exploited in
this work to reduce the area overhead of the NoC. Basi-
cally, a double-binary uncoded symbol u can be represented
Author's personal copy
J Sign Process Syst (2013) 73:83–100 95
as a couple of binary random variables AB. Then, with a
slight abuse of notation, said X a binary random variable,
we denote X = 0 with X and X = 1 with X. Resort-
ing to the Max-Log-MAP approximation we can convert
Symbol-Level (SL) LLRs to Bit-Level (BL) LLRs as
λexti,j [A] 	 μA − μA (14)
λexti,j [B] 	 μB − μB (15)
where
μA = max
{
λexti,j
[
AB
]
, λexti,j [AB]
}
(16)
μA = max
{
0, λexti,j [AB]
}
(17)
μB = max
{
λexti,j [AB], λexti,j [AB]
}
(18)
μB = max
{
0, λexti,j [AB]
}
. (19)
Similarly, we can convert BL LLRs to SL LLRs with the
following approximations.
1) λexti,j [A] ≥ 0 and λexti,j [B] ≥ 0
λexti,j
[
AB
] 	 μAB − λexti,j [B] (20)
λexti,j
[
AB
] 	 μAB − λexti,j [A] (21)
λexti,j [AB] 	 μAB (22)
2) λexti,j [A] ≥ 0 and λexti,j [B] < 0
λexti,j
[
AB
] 	 λexti,j [A] (23)
λexti,j
[
AB
] 	 0 (24)
λexti,j [AB] 	 λexti,j [A] + λexti,j [B] (25)
Table 2 Throughput [Mb/s] - area SL [mm2] - area BL [mm2], PFP achieved with the WiMAX N = 1920 interleaver, with generalized Kautz
topologies for P ∈ {8, 16, 32, 64}, R ∈ {0.33, 0.5, 1}, SSP-RR, SSP-FL and ASP-FT routing algorithms, no ABR.
D = 2
P = 8 P = 16 P = 32 P = 64
R = 1.00 SSP-RR (FA) 104 - 2.15 - 1.46 138 - 3.61 - 2.43 195 - 5.16 - 3.51 264 - 6.97 - 4.85
SSP-FL (FA) 105 - 2.17 - 1.47 144 - 3.40 - 2.30 208 - 4.57 - 3.13 285 - 6.11 - 4.29
ASP-FT (AP) 105 - 1.62 - 0.92 144 - 2.54 - 1.43 208 - 3.42 - 1.99 285 - 4.61 - 2.79
R = 0.50 SSP-RR (FA) 86 - 0.47 - 0.38 127 - 1.48 - 1.07 176 - 3.20 - 2.26 231 - 5.33 - 3.80
SSP-FL (FA) 86 - 0.42 - 0.35 134 - 1.19 - 0.88 187 - 2.74 - 1.96 246 - 4.60 - 3.33
ASP-FT (AP) 86 - 0.42 - 0.35 134 - 1.00 - 0.69 187 - 2.15 - 1.37 246 - 3.56 - 2.29
R = 0.33 SSP-RR (FA) 58 - 0.39 - 0.33 102 - 0.78 - 0.62 153 - 1.94 - 1.45 199 - 4.05 - 2.98
SSP-FL (FA) 58 - 0.36 - 0.31 103 - 0.69 - 0.57 161 - 1.55 - 1.20 209 - 3.41 - 2.56
ASP-FT (AP) 58 - 0.38 - 0.33 103 - 0.67 - 0.54 161 - 1.33 - 0.99 209 - 2.73 - 1.89
D = 3
R = 1.00 SSP-RR (FA) 148 - 1.07 - 0.77 204 - 2.19 - 1.53 306 - 3.59 - 2.53 432 - 5.70 - 4.08
SSP-FL (FA) 165 - 0.69 - 0.53 218 - 2.10 - 1.48 328 - 3.39 - 2.41 448 - 5.39 - 3.89
ASP-FT (AP) 165 - 0.59 - 0.43 249 - 1.34 - 0.86 344 - 2.57 - 1.60 452 - 4.07 - 2.59
R = 0.50 SSP-RR (FA) 87 - 0.47 - 0.38 152 - 0.90 - 0.71 243 - 1.80 - 1.38 333 - 3.87 - 2.91
SSP-FL (FA) 87 - 0.45 - 0.37 152 - 0.85 - 0.68 242 - 1.68 - 1.31 334 - 3.45 - 2.64
ASP-FT (AP) 87 - 0.47 - 0.39 152 - 0.80 - 0.63 244 - 1.43 - 1.07 338 - 2.75 - 1.96
R = 0.33 SSP-RR (FA) 58 - 0.44 - 0.37 103 - 0.84 - 0.67 168 - 1.64 - 1.28 243 - 3.27 - 2.53
SSP-FL (FA) 58 - 0.42 - 0.35 103 - 0.80 - 0.64 167 - 1.57 - 1.23 243 - 3.10 - 2.41
ASP-FT (AP) 58 - 0.44 - 0.37 103 - 0.76 - 0.60 168 - 1.38 - 1.05 243 - 2.57 - 1.89
D = 4
R = 1.00 SSP-RR (FA) 135 - 1.24 - 0.88 253 - 1.79 - 1.29 323 - 3.41 - 2.45 513 - 5.38 - 3.93
SSP-FL (FA) 153 - 0.94 - 0.69 279 - 1.41 - 1.05 344 - 3.21 - 2.33 533 - 5.09 - 3.76
ASP-FT (AP) 166 - 0.63 - 0.46 279 - 1.15 - 0.80 393 - 2.28 - 1.51 533 - 3.99 - 2.66
R = 0.50 SSP-RR (FA) 87 - 0.50 - 0.41 155 - 0.92 - 0.73 247 - 1.98 - 1.53 354 - 3.83 - 2.94
SSP-FL (FA) 87 - 0.48 - 0.39 155 - 0.90 - 0.72 248 - 1.89 - 1.47 356 - 3.70 - 2.84
ASP-FT (AP) 87 - 0.52 - 0.43 155 - 0.87 - 0.69 249 - 1.66 - 1.24 356 - 3.12 - 2.27
R = 0.33 SSP-RR (FA) 58 - 0.47 - 0.39 104 - 0.92 - 0.73 169 - 1.85 - 1.45 248 - 3.61 - 2.80
SSP-FL (FA) 58 - 0.44 - 0.36 104 - 0.88 - 0.70 169 - 1.75 - 1.37 248 - 3.45 - 2.67
ASP-FT (AP) 58 - 0.48 - 0.40 104 - 0.84 - 0.66 169 - 1.57 - 1.19 248 - 2.97 - 2.19
Author's personal copy
96 J Sign Process Syst (2013) 73:83–100
3) λexti,j [A] < 0 and λexti,j [B] ≥ 0
λexti,j
[
AB
] 	 0 (26)
λexti,j
[
AB
] 	 λexti,j [B] (27)
λexti,j [AB] 	 λexti,j [A] + λexti,j [B] (28)
4) λexti,j [A] < 0 and λexti,j [B] < 0
λexti,j
[
AB
] 	 λexti,j [A] (29)
λexti,j
[
AB
] 	 λexti,j [B] (30)
λexti,j [AB] 	 λexti,j [A] + λexti,j [B] − μAB (31)
where
μAB = max{λexti,j [A], λexti,j [B]} (32)
For further details on bit to symbol and symbol to bit
conversion the reader can refer to [38].
The use of BL LLRs introduces a BER performance
loss of about 0.2 dB (see Fig. 11), but it reduces the data
width of one third with respect to SL LLRs, as the pay-
load of each packet contains λexti,j [A] and λexti,j [B] instead
of λexti,j [u]. To further reduce the data width we applied to
BL LLRs the Pseudo-Floating-Point (PFP) representation
suggested in [37]. As highlighted also in [66, 67] the most
significant bits of the extrinsic information play an impor-
tant role in the decoding procedure. Indeed, the basic idea is
to analyze the binary representation of λexti,j [A] and λexti,j [B]
(as 2’s complement values) from the most significant bit to
the least significant bit and to detect the first zero-one or
one-zero transition, which represents the starting bit of the
extrinsic information significand. We denote the significand
as ξ and the number of bits that prefix ξ are coded as a
shift index σ . Thus, for each couple λexti,j [A], λexti,j [B] we
obtain ξi,j [A], ξi,j [B], σi,j [A] and σi,j [B]. Then, accord-
ing with [37], we impose σi,j = min{σi,j [A], σi,j [B]}. Said
nλ, nξ and nσ the number of bits to represent λ, ξ and σ
respectively we obtain
ξ˜i,j [A] = λexti,j [A] >> (nλ − nξ − σi,j ) (33)
ξ˜i,j [B] = λexti,j [B] >> (nλ − nξ − σi,j ) (34)
where >> stands for arithmetic right shift. As a conse-
quence, the payload of each packet sent on the network now
contains ξ˜i,j [A], ξ˜i,j [B] and σi,j instead of λexti,j [u].
Figure 12 Average throughput
improvement at diffe rent SNR
values of the WiMAX
N = 1920 turbo decoder with
K = 4, 6 on generalized Kautz
networks D = 2 and P = 64.
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8260
265
270
275
280
285
290
295
300
SNR [dB]
T 
[M
b/s
]
FL no ABR
FL K = 4
FL K = 6
RR no ABR
RR K = 4
RR K = 6
Author's personal copy
J Sign Process Syst (2013) 73:83–100 97
Figure 13 Average throughput
improvement at different SNR
values of the WiMAX
N = 1920 turbo decoder with
K = 4, 6 on generalized Kautz
networks D = 3 and P = 64.
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8430
435
440
445
450
455
460
465
SNR [dB]
T 
[M
b/s
]
FL no ABR
FL K = 4
FL K = 6
RR no ABR
RR K = 4
RR K = 6
As stated in the first paragraph of this section nλ = 8.
Thus, said nd the number of bits devoted to represent the
extrinsic information in the payload we have: i) nd = 3nλ =
24 for λexti,j [u] and ii) nd = 2nλ = 16 for λexti,j [A] and
λexti,j [B]. If we impose nξ = 4, we obtain σi,j ≤ 4 and so
nσ = 3, leading to nd = 2nξ +nσ = 11 that is less than half
Figure 14 Average throughput
improvement at different SNR
values of the WiMAX
N = 1920 turbo decoder with
K = 4, 6 on generalized Kautz
networks D = 4 and P = 64.
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8510
515
520
525
530
535
540
545
550
555
560
SNR [dB]
T 
[M
b/s
]
FL no ABR
FL K = 4
FL K = 6
RR no ABR
RR K = 4
RR K = 6
Author's personal copy
98 J Sign Process Syst (2013) 73:83–100
the value of nd for λexti,j [u]. As shown in Fig. 11 the BER
performance loss of BL, PFP LLR representation, is nearly
the same as the fixed point BL one. In Table 2 the through-
put and area results obtained by using SL and BL, PFP LLR
representation are shown for generalized Kautz topologies.
As it can be observed, the area decrease as a function of
nd is not linear, however, it becomes particularly interest-
ing when R = 1. As an example, with R = 1, D = 4 and
P = 64 there is an area saving of up to the 40 %.
The techniques described in the previous paragraphs are
all aimed at reducing the area of the NoC-based turbo
decoder. Furthermore, the ABR technique described in
Section 4.1 can be used to improve the throughput as well.
In order to limit the BER performance loss introduced by
the ABR technique, we employ the SL reliability criterion
proposed in [36] but we send BL, PFP extrinsic informa-
tion when the criterion is not met. The ABR technique we
used is summarized in Algorithm 1 and can be summarized
as follows: said ϑapri,j , 
apr
i,j and ϑ
ext
i,j , 
ext
i,j the first and the
Algorithm 1 SL reliability criterion proposed in [36]
1: ϑapri,j ← max
{
λ
apr
i,j [u]
}
2: apri,j ← max
{
λ
apr
i,j [u] \ ϑapri,j
}
3: ϑexti,j ← max
{
λexti,j [u]
}
4: exti,j ← max
{
λexti,j [u] \ ϑexti,j
}
5: apri,j ← |ϑapri,j − apri,j |
6: exti,j ← |ϑexti,j − exti,j |
7: i,j ← |exti,j − apri,j |
8: if i,j < K then
9: do not send any packet
10: else
11: send ξ˜i,j [A], ξ˜i,j [B], σi,j
12: endif
second maximum values in λapri,j [u] and λexti,j [u] respectively,
we compute apri,j = |ϑapri,j − apri,j | and exti,j = |ϑexti,j −
exti,j |; finally, we compare i,j = |exti,j − apri,j | with the
threshold K.
As shown in Fig. 11 the BER performance loss intro-
duced by the ABR technique is negligible. Moreover, as
shown in Figs. 12, 13 and 14 when R = 1 the ABR tech-
nique induces an average throughput increase of about 5 to
20 Mb/s. Similarly to the binary codes in the best case the
throughput improvement is at least of more than 40 Mb/s,
whereas when R < 1 the average throughput improvement
is at most of 5 Mb/s.
5 Conclusions
In this work ABR techniques have been exploited to
improve the throughput of NoC-based turbo decoder archi-
tectures. When the load of the network is high the average
throughput is improved of about 5 to 20 Mb/s and in the
best case the throughput is increased of more than 60 Mb/s
and 40 Mb/s for binary and double-binary codes respec-
tively. Moreover, the area required to support double-binary
codes has been significantly reduced (up to more than the
40 %) by applying BL, PFP representation of the extrin-
sic information with a BER performance loss of about
0.2 dB.
References
1. IEEE Std 802.16, part 16: air interface for fixed broadband
wireless access systems, Oct. 2004.
2. TS 36.212 v8.0.0: Multiplexing and Channel Coding (FDD)
(Release 8), 2007-09.
3. Berrou, C., Glavieux, A., Thitimajshima, P. (1993). Near Shannon
limit error correcting coding and decoding: turbo codes. In IEEE
international conference on communications (pp. 1064–1070).
4. Gallager, R.G. (1962). Low density parity check codes. IRE
Transactions on Information Theory, IT-8(1), 21–28.
5. Boutillon, E., Douillard, C., Montorsi, G. (2007). Iterative decod-
ing of concatenated convolutional codes: implementation issues.
Proceedings of the IEEE, 95(6), 1201–1227.
6. Guilloud, F., Boutillon, E., Tousch, J., Danger, J.L. (2007).
Generic description and synthesis of LDPC decoders. IEEE Trans-
actions on Communications, 55(11), 2084–2091.
7. Martina, M., Masera, G., Moussa, H., Baghdadi, A. (2011). On
chip interconnects for multiprocessor turbo decoding architec-
tures. Elsevier Microprocessors and Microsystems, 35(2), 167–
181.
8. Polydoros, A. (2008). Algorithmic aspects of radio flexibility. In
IEEE international symposium on personal, indoor and mobile
communications (pp. 1–5).
9. Vogt, T., & Wehn, N. (2008). Reconfigurable ASIP for con-
volutional and turbo decoding in an SDR environment. IEEE
Transactions on VLSI, 16(10), 1309–1320.
10. Bougard, B., Priewasser, R., der Perre, L.V., Huemer, M.
(2008). Algorithm-architecture co-design of a multi-standard FEC
decoder ASIP. In ICT mobile summit conference.
11. Martina, M., Nicola, M., Masera, G. (2008). A flexible UMTS-
WiMax turbo decoder architecture. IEEE Transactions on Circuits
and Systems II, 55(4), 369–373.
12. Rovini, M., Gentile, G., Fanucci, L. (2009). A flexible state-
metric recursion unit for a multi-standard BCJR decoder. In IEEE
international conference on signals circuits and systems (pp. 1–6).
13. Kim, J.H., & Park, I.C. (2009). A unified parallel radix-4 turbo
decoder for mobile WiMAX and 3GPP-LTE. In IEEE custom
integrated circuits conference (pp. 487–490).
14. Muller, O., Baghdadi, A., Jezequel, M. (2009). From parallelism
levels to a multi-ASIP architecture for turbo decoding. IEEE
Transactions on VLSI, 17(1), 92–102.
Author's personal copy
J Sign Process Syst (2013) 73:83–100 99
15. Reddy, P., Alkhayat, R., Clermidy, F., Baghdadi, A., Jezequel,
M. (2010). Power consumption analysis and energy efficient
optimization for turbo decoder implementation. In International
symposium on sytem-on-chip (pp. 12–17).
16. Ilnseher, T., May, M., Wehn, N. (2010). A multi-mode 3GPP-
LTE/HSDPA turbo decoder. In IEEE international conference on
communication systems (pp. 336–340).
17. Gentile, G., Rovini, M., Fanucci, L. (2010). A multi-standard
flexible turbo/LDPC decoder via ASIC design. In International
symposium on turbo codes & iterative information processing
(pp. 294–298).
18. Sun, Y., & Cavallaro, J.R. (2010). A flexible LDPC/turbo decoder
architecture. Journal of Signal Processing Systems, 64(1), 1–16.
19. Murugappa, P., Al-Khayat, R., Baghdadi, A., Jezequel, M. (2011).
A flexible high throughput multi-ASIP architecture for LDPC
and turbo decoding. In Design, automation and test in Europe
conference and exhibition (pp. 1–6).
20. Shin, M.C., & Park, I.C. (2007). SIMD processor-based turbo
decoder supporting multiple third-generation wireless standards.
IEEE Transactions on VLSI, 15(7), 801–810.
21. Sun, Y., Zhu, Y., Goel, M., Cavallaro, J.R. (2008). Config-
urable and scalable high throughput turbo decoder architecture
for multiple 4G wireless standards. In IEEE international confer-
ence on application-specific systems, architectures and processors
(pp. 209–214).
22. Lin, C.H., Chen, C.Y., Wu, A.Y. (2011). Area efficient scalable
MAP processor design for high-throughput mulistandard convo-
lutional turbo decoding. IEEE Transactions on VLSI, 19(2), 305–
318.
23. Al-Khayat, R., Murugappa, P., Baghdadi, A., Jezequel, M. (2011).
Area and throughput optimized ASIP for multi-standard turbo
decoding. In IEEE international symposium on rapid system
prototyping (pp. 79–84).
24. Hoffmann, A., Schliebusch, O., Nohl, A., Braun, G., Meyr, H.
(2001). A methodology for the design of application specific
instruction set processors (ASIP) using the machine description
language LISA. In International conference on computer-aided-
design
25. Neeb, C., Thul, M.J., Wehn, N. (2005). Network-on-chip-centric
approach to interleaving in high throughput channel decoders. In
IEEE international symposium on circuits and systems (pp. 1766–
1769).
26. Moussa, H., Muller, O., Baghdadi, A., Jezequel, M. (2007).
Butterfly and Benes-based on-chip communication networks for
multiprocessor turbo decoding. In Design, automation and test in
Europe conference and exhibition (pp. 654–659).
27. Moussa, H., Baghdadi, A., Jezequel, M. (2008). Binary de Bruijn
interconnection network for a flexible LDPC/turbo decoder. In
IEEE international symposium on circuits and systems (pp. 97–
100).
28. Martina, M., & Masera, G. (2010). Turbo NOC: a framework
for the design of network on chip based turbo decoder architec-
tures. IEEE Transactions on Circuits and Systems I, 57(10), 2776–
2789.
29. Wang, G., Sun, Y., Cavallaro, J.R., Guo, Y. (2011). High-
throughput contention-free concurrent interleaver architecture for
multi-standard turbo decoder. In IEEE international conference
on application-specific systems, architectures and processors
(pp. 113–121).
30. Vacca, F., Moussa, H., Baghdadi, A., Masera, G. (2009). Flexi-
ble architectures for LDPC decoders based on network on chip
paradigm. In Euromicro conference on digital system design
(pp. 582–589).
31. Guerrier, P., & Greiner, A. (2000). A generic architecture for on-
chip packet-switched interconnections. In Design, automation and
test in Europe conference and exhibition (pp. 250–256).
32. Dally, W.J., & Towels, B. (2001). Route packets, not wires: On-
chip interconnection networks. In Design automation conference
(pp. 684–689).
33. Benini, L., & Micheli, G.D. (2002). Networks on chips: a new soc
paradigm. IEEE Computer, 35(1), 70–78.
34. Kumar, S., Jantsch, A., Soininen, J.P., Forsell, M., Millberg, M.,
Oberg, J., Tiensyrja, K., Hemani, A. (2002). A network on chip
architecture and design methodology. In IEEE computer society
annual symposium on VLSI (pp. 105–112).
35. Awais, M., & Condo, C. (2012). Flexible LDPC decoder architec-
tures. VLSI Design, 2012, 1–16.
36. Muller, O., Baghdadi, A., Jezequel, M. (2006). Bandwidth reduc-
tion of extrinsic information exchange in turbo decoding. IET
Electronics Letters, 42(19), 1104–1105.
37. Park, S.M., Kwak, J., Lee, K. (2008). Extrinsic information
memory reduced architecture for non-binary turbo decoder imple-
mentation. In IEEE vehicular technology conference (pp. 539–
543).
38. Kim, J.H., & Park, I.C. (2009). Bit-level extrinsic information
exchange method for double-binary turbo codes. IEEE Transac-
tions on Circuits and Systems II, 56(1), 81–85.
39. Berrou, C., Jezequel, M., Douillard, C., Kerouedan, S. (2001). The
advantages of non-binary turbo codes. In IEEE information theory
workshop (pp. 61–63).
40. Bahl, L.R., Cocke, J., Jelinek, F., Raviv, J. (1974). Optimal
decoding of linear codes for minimizing symbol error rate. IEEE
Transactions on Information Theory, 20(3), 284–287.
41. Robertson, P., Villebrun, E., Hoeher, P. (1995). A comparison of
optimal and sub-optimal MAP decoding algorithms operating in
the Log domain. In IEEE ICC (pp. 1009–1013).
42. Robertson, P., Hoeher, P., Villebrun, E. (1997). Optimal and
sub-optimal maximum a posteriori algorithms suitable for turbo
decoding. European Transactions on Telecommunications, 8(2),
119–125.
43. Papaharalabos, S., Mathiopoulos, P.T., Masera, G., Martina, M.
(2009). On optimal and near-optimal turbo decoding using gener-
alized max∗ operator. IEEE Communications Letters, 13(7), 522–
524.
44. Flich, J., & Duato, J. (2008). Logic-based distributed routing for
NoCs. IEEE Computer Architecture Letters, 7(1), 13–16.
45. Shi, Z., Yang, Y., Zeng, X., Yu, Z. (2011). A reconfigurable
and deadlock-free routing algorithm for 2D Mesh Network-on-
Chip. In IEEE international symposium of circuits and systems
(pp. 2934–2937).
46. Moraveji, R., Sarbazi-Azad, H., Zomaya, A.Y. (2010). A gen-
eral methodology for direction-based irregular routing algorithms.
Journal of Parallel and Distributed Computing, 70(4), 363–370.
47. http://www.3gpp2.org. 2004.
48. Wang, Z., & Li, Q. (2007). Very low-complexity hardware inter-
leaver for turbo decoding. IEEE Transactions on Circuits and
Systems II, 54(7), 636–640.
49. Martina, M., Nicola, M., Masera, G. (2008). Hardware design of
a low complexity, parallel interleaver for wimax duo-binary turbo
decoding. IEEE Communications Letters, 12(11), 846–848.
50. Martina, M. (2012). Turbo NOC: Network On Chip based
turbo decoder architectures, downloadable at http://personal.
delen.polito.it/maurizio.martina/turbo.html.
51. Bang-Jensen, J., & Gutin, G. (2008). Digraphs, theory, algorithms
and applications (2nd ed.) London: Springer-Verlag.
Author's personal copy
100 J Sign Process Syst (2013) 73:83–100
52. Imase, M., & Itoh, M. (1981). Design to minimize diameter on
building-block network. IEEE Transactions on Computers, 30(6),
439–442.
53. Imase, M., & Itoh, M. (1983). A design for directed graphs with
minimum diameter. IEEE Transactions on Computers, 32(8), 782–
784.
54. Sabbaghi-Nadooshan, R., & Sarbazi-Azad, H. (2008). The kautz
mesh: a new topology for socs. In IEEE international SoC design
conference (pp. 300–303).
55. Sabbaghi-Nadooshan, R. (2011). Kautz mesh topology for on-chip
networks. Journal of Computing, 3(2), 33–40.
56. Pulimeno, A., Graziano, M., Piccinini, G. (2012). UDSM trends,
comparison: From technology roadmap to UltraSparc Niagara2.
IEEE Transactions on VLSI Systems, 20(7), 1341–1346.
57. Angiolini, F., Meloni, P., Carta, S.M., Raffo, L., Benini, L. (2007).
A layout-aware analysis of networks-on-chip and traditional inte-
connects for MPSoCs. IEEE Transactions on Computer-aided
Design of Integrated Circuits and Systems, 26(3), 421–434.
58. Meloni, P., Loi, I., Angiolini, F., Carta, S., Barbaro, M., Raffo, L.,
Benini, L. (2007). Area and power modeling for networks-on-chip
with layout awareness. VLSI Design. special issue on Networks on
Chip, Article ID 50285, 12 pages.
59. Hosseinabady, M., Kakoee, M.R., Mathew, J., Pradhan, D.K.
(2007). Reliable network-on-chip based on generalized de Bruijn
graph. In IEEE international high level design validation and test
workshop (pp. 3–10).
60. Hosseinabady, M., Kakoee, M.R., Mathew, J., Pradhan, D.K.
(2008). De Bruijn graph as a low latency scalable architecture for
energy efficient massive NoCs. In Design, automation and test in
Europe conference and exhibition (pp. 1370–1373).
61. Samatham, M.R., & Pradhan, D.K. (1989). The De Bruijn mul-
tiprocessor network: a versatile parallel processing and sorting
network for VLSI. IEEE Transactions on Computers, 38(4), 567–
581.
62. Cheng, C.C., Tsai, Y.M., Chen, L.G., Chandrakasan, A.P. (2010).
A 0.077 to 0.168 nJ/bit/iteration scalable 3GPP LTE turbo decoder
with an adaptive sub-block parallel scheme and an embedded
DVFS engine. In IEEE custom integrated circuits conference
(pp. 1–4).
63. Gentile, G., Rovini, M., Fanucci, L. (2010). Low-power tech-
niques for flexible channel decoders. In International conference
on applied electronics (pp. 1–4).
64. Reddy, P., Clermidy, F., Baghdadi, A., Jezequel, M. (2011). A
low complexity stopping criterion for reducing power consump-
tion in turbo decoders. In Design, automation and test in Europe
conference and exhibition (pp. 1–6).
65. Kim, D.H., & Kim, S.W. (2006). Bit-level stopping of turbo
decoding. IEEE Communications Letters, 10(3), 183–185.
66. Vogt, J., Ertel, J., Finger, A. (2000). Reducing bit width of extrin-
sic memory in turbo decoder realisations. IEE Electronics Letters,
36(20), 1714–1716.
67. Singh, A., Boutillon, E., Masera, G. (2008). Bit-width optimiza-
tion of extrinsic information in turbo decoder. In International
symposium on turbo codes & related topics (pp. 134–138).
Maurizio Martina was born in Pinerolo, Italy, in 1975. He received
the M.Sc. and Ph.D. in electrical engineering from Politecnico di
Torino, Italy, in 2000 and 2004 respectively. He is currently assis-
tant professor at the VLSI Lab, Politecnico di Torino. His research
activities include VLSI design and implementation of architectures for
digital signal processing and comunications.
Guido Masera received the Dr.Eng. degree (summa cum laude) in
1986, and the Ph.D. degree in electrical engineering from Politecnico
di Torino, Italy, in 1992. Since 1986 to 1988 he was with CSELT
(Centro Studi e Laboratori in Telecomunicazioni, Torino, Italy) as a
researcher involved in the standardization activities for the GSM sys-
tem. Since 1992 he has been Assistant Professor and then Associate
Professor at the Electronic Department, where he is a member of the
VLSI-Lab group. His research interests include several aspects in the
design of digital integrated circuits and systems, with special emphasis
on high-performance architecture development (especially for wireless
communications and multimedia applications) and on-chip intercon-
nect modeling and optimization. He has coauthored more than 160
journal and conference papers in the areas of ASIC-SoC development,
architectural synthesis, VLSI circuit modeling and optimization. In the
frame of National and European research projects, he has been co-
designer of several ASIC and FPGA implementations in the fields of
Artificial Intelligence, Computer Networks, Digital Signal Processing,
Transmission and Coding.
Author's personal copy
