Advanced Connection Allocation Techniques in Circuit Switching Network on Chip by Chen, Yong
Technische Universität Dresden
Advanced Connection Allocation Techniques
in Circuit Switching
Network on Chip
Yong Chen
Born in Anhui, China, 10 June 1989
von der Fakultät Elektrotechnik und Informationstechnik
der Technischen Universität Dresden
zur Erlangung des akademischen Grades
Doktoringenieur
(Dr.-Ing.)
genehmigte Dissertation
Vorsitzender: Prof. Dr.-Ing. Christian G. Mayr
Gutachter: Prof. Dr.-Ing. Gerhard P. Fettweis
Prof. Dr.-Ing. Holger Blume
Prof. Dr. Ing. Ralf Lehnert
Tag der Einreichung: 29.06.2017
Tag der Verteidigung: 08.09.2017
II
Abstract
With the advancement of semiconductor technology, the System on Chip (SoC) is becom-
ing more and more complex, so the on-chip communication has become a bottleneck of SoC
Design. Since the traditional bus system is ineﬃcient and not scalable, the Network-On-
Chip (NoC) has emerged as the promising communication mechanism for complex SoCs.
As some systems have speciﬁc performance requirements, such as a minimum throughput
(for real-time streaming data) or bounded latency (for interrupts, process synchroniza-
tion, etc), communication with Guaranteed Service (GS) support becomes crucial for
predictable SoC architectures. Circuit Switching (CS) is a popular approach to support
GS, which ﬁrstly has to allocate an exclusively connection (circuit) between the source
and destination nodes, and then the data packets are delivered over this connection. How-
ever, it is ineﬃcient and inﬂexible because the resource is occupied by single connection
during its whole lifetime, which can block other communications. Hence, two extensions
of CS have been proposed to share resources: i) Time-Division Multiplexing (TDM), in
which the available link capacity is split into multiple time slots to be shared by diﬀerent
ﬂows in TDM scheme; and ii) Space-Division-Multiplexing (SDM), in which only a subset
(sub-channel) of the link wires is exclusively allocated to a speciﬁc connection, while the
remaining wires of the link can be used by other ﬂows.
The connection allocation is critical for CS, since the data delivery can start only after the
associated connection is allocated. In this thesis, we propose a dedicated hardware con-
nection allocator to solve the dynamic connection allocation problem for CS NoCs, which
has to i) allocate a contention-free path between source-destination pairs and ii) allocate
appropriate portions of link bandwidth (appropriate number of time slots and subsets)
along the path. The dedicated connection allocator, called NoCManager, solves the con-
nection allocation problem by employing a trellis-search based shortest path algorithm.
The trellis search can explore all possible paths between source node and destination.
Moreover, it shall ﬁnd the requested path in a ﬁxed low latency and can guarantee the
path optimality in terms of path length if the path is available.
In this thesis, two diﬀerent trellis graphs, Forward-Backtrack trellis and Register-Exchange
trellis are proposed. The Forward-Backtrack trellis completes the path search in two steps:
forward search and backtracking. Firstly, the forward search begins at source node that
traverses the network to ﬁnd the free path. When destination node is reached, the back-
track starts from destination to select the survivor path and collect the associated path
parameters. However, Register-Exchange trellis saves the entire survivor path sequences
during forward search. Consequently, the backtracking step can be omitted, and thus the
III
IV
allocation time is halved compared to forward-backtrack approaches. Moreover, each trel-
lis graph consists of three categories, unfolded structure, folded structure and bidirectional
structure. The unfolded structure can provide high allocation speed while folded struc-
ture is more eﬃcient from a hardware point of view. The bidirectional structure starts
the search at two sides, source node and destination node simultaneously, so the alloca-
tion speed is 2 times faster than previous unidirectional search. Furthermore, in order to
address the scalability issue of previous centralized systems, the partitioned architecture
(i.e. spatial partitioning technique) is proposed to divide the large system into multiple
smaller diﬀerentiated logical partitions served by local NoCManagers. This partitioning
technique keeps the request load of the manager and manager-node communication over-
head moderate. Inside each partition, the path search problem is solved by a local manager
with trellis-search algorithm. To establish a path that crosses partitions, the managers
communicate with each other in distributed manner to converge the global path.
In order to further enhance the path diversity and resource utilization, we adopt the
combined TDM and SDM technique. In combined TDM-SDM approach, each SDM sub-
channel is split into multiple time slots so that can be shared by multiple ﬂows. Hence,
the number of sub-channels can be kept moderate to reduce router complexity, while still
providing higher path diversity than TDM scheme. In order to investigate and optimize
TDM-SDM partitioning strategy, we studied the inﬂuence of diﬀerent TDM-SDM link
partitioning strategies on success rate and path length that allowed us to ﬁnd the optimal
solution. The dedicated connection allocator using the trellis-search algorithm is employed
for TDM, SDM and TDM-SDM CS.
In the end, we present the router architecture that combines the circuit-switching network
(for GS communication) and packet-switching network (for best-eﬀort communication).
Acknowledgment
This thesis comprises of the results generated during my work at the Vodafone Chair
Mobile Communications Systems at Technische Universität Dresden between 2013 and
2017.
First of all I would like to thank my supervisor Prof. Gerhard Fettweis for giving me the
opportunity to join his group and inspiring me with his invaluable guidance. I cannot be
any less grateful to have Prof. Blume as my co-supervisor whose guidance and mentoring
had made this journey possible. I also thank my group leader Dr. Emil Matus. Emil is a
very good person. He teaches me how to do research, and helps me go through the diﬃcult
times. Additionally, I am grateful to my colleagues, Sadia Moriam, Seungseok Nam and
Mohammed Radi, for reviewing the thesis.
I would like to thank all colleagues from the chair. I would like to thank Friedrich, Se-
bastian, Wen, Stefan, Robert, Mattis, Amanda, Song, Zou and Zhang for all the joys and
fun shared with me.
Finally, I am grateful to my parents, and my wife Aihong, for their faith in me. Your
support, encouragement and unwavering love made me able to ﬁnish the thesis.
Yong Chen
Dresden, Germany, June 2017
V
VI
Contents
Abstract III
List of Symbols XI
List of Abbreviations XI
List of Figures XIII
List of Tables XIX
1 Introduction 1
1.1 Overview of on-chip interconnection solutions . . . . . . . . . . . . . . . . 1
1.2 Network on Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Guaranteed Service in NoCs . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 Circuit Switching NoCs . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Scope and Outline of this Work . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Connection Allocation in CS NoCs 9
2.1 Connection allocation problem . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Distributed allocation techniques . . . . . . . . . . . . . . . . . . . 11
2.2.2 Centralized allocation techniques . . . . . . . . . . . . . . . . . . . 12
2.3 Trellis Search based Allocation approach . . . . . . . . . . . . . . . . . . . 13
3 Centralized Connection Allocation for TDM CS NoCs 15
3.1 Introduction of TDM CS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.1 Connection allocation in TDM CS . . . . . . . . . . . . . . . . . . . 17
VII
VIII Contents
3.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Connection Allocator Architecture . . . . . . . . . . . . . . . . . . . . . . . 20
3.3.1 Formalizing The Trellis Graph Structure . . . . . . . . . . . . . . . 20
3.3.1.1 General Model . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.1.2 Path Search Model Simpliﬁcation . . . . . . . . . . . . . . 25
3.3.2 Forward-Backtrack Trellis Path Search . . . . . . . . . . . . . . . . 25
3.3.2.1 Unfolded Trellis Search . . . . . . . . . . . . . . . . . . . . 29
3.3.2.2 Folded Trellis Search . . . . . . . . . . . . . . . . . . . . . 29
3.3.2.3 Bidirectional Trellis Search . . . . . . . . . . . . . . . . . 31
3.3.3 Forward-Backtrack Trellis Path Search Implementation . . . . . . . 31
3.3.3.1 Unfolded Trellis Implementation . . . . . . . . . . . . . . 32
3.3.3.2 Bidirectional Trellis Implementation . . . . . . . . . . . . 34
3.3.3.3 Folded Trellis Implementation . . . . . . . . . . . . . . . . 34
3.4 Performance Evaluation of Forward-Backtrack trellis . . . . . . . . . . . . 35
3.4.1 Synthesis Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4.2.1 Comparison with centralized exhaustive path-search . . . 39
3.4.2.2 Comparison with distributed parallel probe search . . . . 41
3.4.2.3 Inﬂuence of splitting the link into diﬀerent time slots on
success rate . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4.2.4 Inﬂuence of allowing diﬀerent hops of detours on success
rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.5 Register-Exchange Trellis Search . . . . . . . . . . . . . . . . . . . . . . . . 47
3.5.1 Register-Exchange Trellis Path Search Algorithm . . . . . . . . . . 47
3.5.1.1 Unfolded Register-Exchange Trellis Path Search . . . . . . 47
3.5.1.2 Folded Register-Exchange Trellis Path Search . . . . . . . 49
3.5.2 Trellis Path Search Implementation . . . . . . . . . . . . . . . . . . 49
3.5.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 51
3.5.3.1 Synthesis Results . . . . . . . . . . . . . . . . . . . . . . 51
3.5.3.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . 53
3.6 Single Layer Trellis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.6.1 Single-layer Trellis Path Search Algorithm . . . . . . . . . . . . . . 57
Contents IX
3.6.2 Single-layer Trellis Path Search Implementation . . . . . . . . . . . 59
3.6.3 Synthesis Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.7 Partitioned Trellis Architecture . . . . . . . . . . . . . . . . . . . . . . . . 61
3.7.1 Partitioned TESSA search algorithm . . . . . . . . . . . . . . . . . 63
3.7.2 NoCM architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.7.2.1 Control signals . . . . . . . . . . . . . . . . . . . . . . . . 64
3.7.2.2 Control trellis path search in each partition . . . . . . . . 65
3.7.2.3 Ensuring that the destination is activated only once by
one request . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.7.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 67
3.7.3.1 Synthesis Results . . . . . . . . . . . . . . . . . . . . . . 67
3.7.3.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . 68
3.7.3.3 Suggestion on how to partition the system . . . . . . . . . 70
3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4 Centralized Connection Allocation for Combined TDM-SDM CS NoCs 75
4.1 Introduction of SDM CS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.1.1 Combined TDM and SDM CS . . . . . . . . . . . . . . . . . . . . . 76
4.1.2 Connection allocation in combined TDM and SDM CS . . . . . . . 77
4.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.3 Connection Allocator Architecture . . . . . . . . . . . . . . . . . . . . . . . 78
4.3.1 Formalizing The Trellis Graph Structure . . . . . . . . . . . . . . . 79
4.3.2 Trellis Path Search Algorithm . . . . . . . . . . . . . . . . . . . . . 79
4.3.2.1 Unidirectional Trellis Path Search . . . . . . . . . . . . . . 79
4.3.2.2 Bidirectional Trellis Path Search . . . . . . . . . . . . . . 81
4.3.3 Trellis Path Search Implementation . . . . . . . . . . . . . . . . . . 81
4.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.4.1 Synthesis Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.4.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.4.2.1 Inﬂuence of diﬀerent link partitioning on success rate . . . 85
4.4.2.2 Evaluation of diﬀerent link partitioning under certain
background traﬃc . . . . . . . . . . . . . . . . . . . . . . 87
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
X Contents
5 Router Design 91
5.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.2 Proposed Router Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.2.1 Router architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.2.2 Packet-switching Part . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.2.3 Circuit-switching Part . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.3 Synthesis Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6 Conclusions and Future Work 99
Bibliography 101
List of Symbols
S Slot table size
T the total link capacity
t a speciﬁc time slot
R required bandwidth
H route time over single hop
m the path length from source to destination
B branch metric
a the available slots of the branch
r the number of requested slots
w the weight of the branch
P path metric
Sj the set of states that have transitions to state j
l the distance between source and destination
d the number of allowed detours
M the number of routers of the whole network
N the side length of the network, hence the network is N ·N
Teff the eﬀective time
E the error rate
C the number of sub-channels
XI
XII List of Symbols
List of Abbreviations
ATM Asynchronous Transfer Mode
Ans Answer signal
Ack Acknowledgment
AT Area.Time product
AT/S Area.Time/Success Rate measure
BE Best Eﬀort
bk background traﬃc
CS Circuit Switching
Des destination
DSS Detect-Select-Shift
DS Detect-Select
EoP End-Of-Packet
FB TESSA Forward-Backtrack TrElliS-Search based Allocation
GS Guaranteed Services
HAGAR HArdware Graph ARray
HPU Header Parsing Unit
IP Intellectual Property
N Node
NoC Network-on-Chip
NoCM NoCManager
NI Network Interface
PE Processing Element
RE TESSA Register-Exchange TrElliS-Search based Allocation
R Router
SDM Space-Division-Multiplexing
SoC System on Chip
Src source
TDM Time-Division Multiplexing
TESSA TrElliS-Search based Allocation
2D 2 dimension
XIII
XIV List of Abbreviations
List of Figures
1.1 Interconnection solutions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Circuit Switching network. Each ﬂow has its own dedicated point-to-point
connection to transfer data. Src: source node, Des: destination node. . . . . 5
1.3 General operating procedure of Circuit Switching. . . . . . . . . . . . . . . 6
2.1 Src tries to ﬁnd a contention-free path to the Des. As the detour is allowed,
during the path search, at each node there are up to 4 directions to go. Src:
source node, Des: destination node. . . . . . . . . . . . . . . . . . . . . . . 10
3.1 The classiﬁcation tree of TESSA structures. . . . . . . . . . . . . . . . . . 16
3.2 Contention-free TDM CS routing. . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 A connection allocation example for TDM CS. Src: source node, Des: des-
tination node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 Proposed System model of the NoCManager based NoC platform. Src:
source node, Des: destination node. . . . . . . . . . . . . . . . . . . . . . . 19
3.5 Block diagram of the NoCManager . . . . . . . . . . . . . . . . . . . . . . 20
3.6 Network graph represented by trellis graph . . . . . . . . . . . . . . . . . . 22
3.7 Multiple stages trellis graph . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.8 Each slot at the initial stage has its own layer of trellis. . . . . . . . . . . . 26
3.9 Each slot searches its own path in parallel. . . . . . . . . . . . . . . . . . . 27
3.10 Communication bandwidth split over multiple paths. Src: source node, Des:
destination node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.11 The ﬂow chart of the path search in trellis. . . . . . . . . . . . . . . . . . . 28
3.12 a)2x2 2D-mesh example NoC; b)schematic structure of the unfolded trellis
Search for the example NoC; c)schematic structure of the folded trellis
Search; d)schematic structure of the bidirectional trellis Search. Src: source
node, Des: destination node. . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.13 Implementation details of an example DSS unit of node 3 . . . . . . . . . . 32
XV
XVI List of Figures
3.14 Implementation schematic of the unfolded trellis for the example NoC . . . 33
3.15 Implementation schematic of the bidirectional trellis for the example NoC . 33
3.16 Implementation schematic of the folded trellis for the example NoC . . . . 34
3.17 Area of diﬀerent TESSA in diﬀerent size NoC with diﬀerent slot table size 36
3.18 Average AT complexity per allocation of diﬀerent TESSA in diﬀerent size
NoC with diﬀerent slot table size . . . . . . . . . . . . . . . . . . . . . . . 37
3.19 Average Energy consumption per allocation of diﬀerent TESSA in diﬀerent
size NoC with diﬀerent slot table size . . . . . . . . . . . . . . . . . . . . . 37
3.20 Allocation speed compared to Microblaze software-based approach[SNG12]
with diﬀerent background in 4x4 NoC with slot table size of 16. . . . . . . 39
3.21 Success Rate compared to single path solutions in 4x4 NoC with diﬀerent
background traﬃc with Slot Table Size of 16. . . . . . . . . . . . . . . . . . 40
3.22 Success Rate compared to single path approach in 8x8 NoC with diﬀerent
background traﬃc with Slot Table Size of 16. . . . . . . . . . . . . . . . . . 41
3.23 Allocation speed of bidirectional TESSA compared to probe search in dif-
ferent networks with diﬀerent GS oﬀered load with Slot Table Size of 16. . 42
3.24 Success Rate of Bidirectional TESSA compared to probe search in diﬀerent
networks. Each connection delivers 200 ﬂits. . . . . . . . . . . . . . . . . . 43
3.25 Success Rate compared to probe search in 6x6 and 8x8 mesh networks with
8 or 16 slot table size. Each connection delivers 100 or 500 ﬂits. . . . . . . 44
3.26 Success Rate inﬂuence of split link into diﬀerent slots in 6x6 and 8x8 net-
works. Each connection delivers 200 or 500 ﬂits. . . . . . . . . . . . . . . . 45
3.27 Success rate under diﬀerent hops of allowed detours in 6x6 network with 8
or 16 slot table size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.28 Success rate under diﬀerent hops of allowed detours in 8x8 network with 8
or 16 slot table size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.29 2x2 2D-mesh example NoC. . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.30 Unfolded RE trellis path search. The survivor path is read directly from
destination node without backtrack. . . . . . . . . . . . . . . . . . . . . . . 48
3.31 The folded RE trellis search graph of the example NoC. . . . . . . . . . . . 48
3.32 Block diagram of single state. . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.33 Implementation schematic of the folded RE trellis . . . . . . . . . . . . . . 50
3.34 Area of folded FB and folded RE NoCManagers in diﬀerent size NoCs with
diﬀerent slot table sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
List of Figures XVII
3.35 Average AT complexity per allocation of RE and FB NoCManagers in
diﬀerent size NoCs with diﬀerent slot table sizes. . . . . . . . . . . . . . . . 52
3.36 Average Energy consumption per allocation of RE and FB NoCManagers
in diﬀerent size NoCs with diﬀerent slot table sizes. . . . . . . . . . . . . . 52
3.37 Allocation speed comparison between RE TESSA and FB TESSA. . . . . . 53
3.38 Success Rate compared to FB TESSA and probe search in 6x6 network
with Slot Table Size of 16 and 8. Each connection delivers 100 or 200 ﬂits. 54
3.39 Success Rate compared to FB TESSA and probe search in 8x8 network
with Slot Table Size of 16. Each connection delivers 300 or 500 ﬂits. . . . . 55
3.40 Average Area.Time/Success Rate (per allocation) in 6x6 network with Slot
Table Size of 16 and 8. Each connection delivers 100 or 200 ﬂits. . . . . . . 56
3.41 2x2 example NoC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.42 Multiple-layer trellis of the example NoC. Each slot at the initial stage has
its own layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.43 General schematic structure of the single-layer approach. The slot table size
is S, so there are S-1 additional stages. The ﬁrst S stages are associated
with S time slots, which can launch S initial searches simultaneously at the
source node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.44 Single-layer path search example. . . . . . . . . . . . . . . . . . . . . . . . 59
3.45 The implementation schematic of node 3 at stage 1, so the associated time
slot is slot 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.46 Area of unfolded single layer and multi-layer NoCManagers in diﬀerent size
NoCs with diﬀerent slot table sizes. . . . . . . . . . . . . . . . . . . . . . . 60
3.47 Average Energy consumption per allocation of unfolded single layer and
multi-layer NoCManagers in diﬀerent size NoCs with diﬀerent slot table
sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.48 The original NoC is divided into 4 partitions with 4 dedicated NoCMs.
The green arrow: forward search, purple arrow: backtrack, and the red ar-
row: communication among NoCMs. The border nodes A and E are back-
tracked as survivor path. The cross-partition search is along NoCM_A→
(NoCM_B,NoCM_C)→ NoCM_D. The path search inside each parti-
tion is done as forward-backtrack trellis search, while cross-partition search
among NoCMs is as probe search. . . . . . . . . . . . . . . . . . . . . . . . 62
3.49 The probe search among NoCMs. Each node represents a NoCM. . . . . . 64
3.50 Block diagram of the NoCManager . . . . . . . . . . . . . . . . . . . . . . 65
3.51 At the beginning, NoCM A is free. At state 1, probe search comes, NoCM
A becomes busy. At state 2, Nack comes. At state 3, Ack comes. . . . . . . 66
XVIII List of Figures
3.52 The search procedure of partitioned TESSA. . . . . . . . . . . . . . . . . . 66
3.53 Total area of non-partitioned and partitioned NoCMs in diﬀerent size NoC
with diﬀerent slot table size. The x-axis label ‘16, 4slot’ indicates 16x16
mesh with 4 slots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.54 Comparison of allocation speed of partitioned and non-partitioned TESSA
and probe search in diﬀerent network with Slot Table Size of 8. . . . . . . 69
3.55 Comparison of Success Rate of partitioned and non-partitioned TESSA
and probe search in 16x16 and 20x20 networks with Slot Table Size of 8. . 70
3.56 Comparison of Success Rate of partitioned and non-partitioned TESSA
and probe search in 18x18 network with Slot Table Size of 4 or 8. . . . . . 71
3.57 The boundary of request injection rate under the NoCM’s capacity in dif-
ferent NoC sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.1 Connection allocation in TDM CS and SDM CS. . . . . . . . . . . . . . . 76
4.2 Combined TDM-SDM routers with each link split into 2 sub-channels and
3 time slots. Along the path, if time slot 1 is reserved at a sub-channel in
router R1, slot 2 at any of the two sub-channels can be reserved in R2. . . 77
4.3 System Model of the NoCManager based NoC platform. Src: source node,
Des: destination node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.4 Example NoC graph. Each link has two sub-channels. A node can reach
itself (curve arrow). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.5 The example NoC can be represented by trellis. Assume each link has 2 sub-
channels (C1 and C2). The green dotted arrow indicates the search of this
edge failed because it is not available at the moment (already occupied).
Src: source node, Des: destination node. . . . . . . . . . . . . . . . . . . . 80
4.6 Implementation schematic of the bidirectional trellis graph . . . . . . . . . 81
4.7 Implementation details of a DS unit with 2 sub-channels for single slot of
node 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.8 Area of (unidirectional) TESSA and Bidirectional TESSA NoCManagers
with diﬀerent link partitioning for diﬀerent NoC sizes. . . . . . . . . . . . . 83
4.9 Average AT complexity per allocation of two diﬀerent NoCManagers with
diﬀerent link partitioning for diﬀerent NoC sizes. . . . . . . . . . . . . . . . 84
4.10 Average energy consumption per allocation of two diﬀerent NoCManagers
with diﬀerent link partitioning for diﬀerent NoC sizes. . . . . . . . . . . . . 84
4.11 Inﬂuence of diﬀerent link partitioning on success rate in 6x6 mesh. . . . . . 86
4.12 Inﬂuence of diﬀerent link partitioning on success rate under background
traﬃc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
List of Figures XIX
4.13 Inﬂuence of diﬀerent link partitioning on average path length under back-
ground traﬃc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.14 Inﬂuence of diﬀerent link partitioning on average delivery latency under
background traﬃc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.1 System Model of the NoCManager based NoC platform. . . . . . . . . . . 92
5.2 The proposed router architecture. . . . . . . . . . . . . . . . . . . . . . . . 93
5.3 Block diagram of the proposed combined BE-GS router with 4 ports. As
an example a GS connection (green arrow) from port south to north is
established and the GS ﬂits from south to north are directly forwarded. . . 94
5.4 GS ﬂits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
XX List of Figures
List of Tables
1.1 The interconnections advantages and disadvantages . . . . . . . . . . . . . 3
1.2 Comparison between Circuit Switching vs. Packet Switching networks . . . 5
3.1 The comparison of diﬀerent TESSA structures. . . . . . . . . . . . . . . . 38
3.2 The usage of control signals . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.1 The control signals of GS phit . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.2 Resource Consumption of the Proposed Router with 65 nm technology . . 96
XXI
XXII List of Tables
Chapter 1
Introduction
Due to the advances of semiconductor technology, the transistor size is decreasing while
the die size is increasing. As a result, nowadays, a single chip can contain billions of
transistors, so more devices and Intellectual Property (IP) cores can be integrated on a
single chip. Consequently, the on-chip interconnection (communication) problem, i.e. how
to connect the on-chip modules, is becoming a critical issue for complex System on Chip
(SoC) design [WL03].
1.1 Overview of on-chip interconnection solutions
The on-chip interconnection is used to connect the modules of the on-chip system, to
connect e.g. processors, memories, and peripherals. A typical interconnection diagram is
illustrated in Fig. 1.1a.
The bus is a widely used interconnect, which is simple and cost eﬀective with hardware
complexity O(n) and performance scalable typically for a small number of nodes. All the
modules on a chip are connected to a shared bus, while only one master module can
access it at a time, as depicted in Fig. 1.1b. The data sent from a module is broadcasted
to all the other modules. This induces a contention among competing modules resolved
through arbitration, to determine who gains access to the bus at a time. The problem
of the bus-based system is the lack of scalability. When the number of the connected
modules increases, the average bandwidth per module decreases while the contention rate
increases. Besides this, as the wire length and capacitance increases, the wiring delay
increases, which reduces the operating frequency.
In order to enable simultaneous communications among multiple modules, the crossbar
switch is proposed, as depicted in Fig. 1.1c. In crossbar, every module is connected to
every other modules. The communication bandwidth is enhanced since multiple com-
munications can be supported simultaneously as long as they occur between diﬀerent
modules. Though crossbar can provide low latency and high throughput, it is expensive.
1
2 1 Introduction
Processor
On-chip interconnection
Accelerator MEM I/O
(a) On-chip interconnection diagram
Processor Processor
MEM MEM
Accelerat
or
Accelerat
or
I/O I/O
(b) Bus
P1
P2
P3
P4
P1 P2 P3 P4
(c) Crossbar
R R
RR
R
R
RR R
NIPE
PE
PE
PE
PE
PE
PE
PE
PE
(d) Mesh NoC
Figure 1.1: Interconnection solutions.
The hardware complexity of crossbar scales with O(n2), where n is the number of the
connected modules. Hence, it is good for small number of modules, but too expensive for
large system.
In order to address the aforementioned problems, the advanced interconnection solution,
Network-On-Chips (NoC) is proposed [DT04, BDM02, GHKM11]. The NoC borrows
1.2 Network on Chip 3
Table 1.1: The interconnections advantages and disadvantages
Bus Crossbar NoC
scaling 10 10- 100 10+
wire cost low high average
logic/buﬀ cost low high average
throughput low highest high
energy eﬃciency poor average good
ideas from large-scale communication networks, e.g. Internet or Asynchronous Transfer
Mode (ATM) network, to create a scalable on-chip communication network. The Process-
ing Element (PE), or module, is connected to a router via Network Interface (NI). Data
transfer in point-to-point communication is controlled and relayed by routers. Hence, the
long wires between a source node and a destination are avoided, and the data packet may
take multiple hops through the network to reach the destination. The popular NoC topol-
ogy, a mesh, is depicted in Fig. 1.1d. Since each router just needs to connect the adjacent
routers, the wire length is short. Concurrent communications is possible as long as they
are delivered over diﬀerent links. Since a mesh has a regular structure and equal-length
links, it is easy to layout. Moreover, it has high path diversity so that there are many
alternate paths possible from one node to another. Therefore, the NoC can provide large
communication bandwidth, low wiring complexity and good scalability. The advantages
and disadvantages of the aforementioned interconnections is shown in table 1.1 [Ste12].
1.2 Network on Chip
There is some terminology for NoC, which is:
1. Router connects ﬁxed number of input links to ﬁxed number of output links, which
forwards the incoming packet to the speciﬁc output according to predeﬁned routing
policy.
2. Network Interface (NI) connects the PE to network and decouples communica-
tion.
3. Link is a bundle of wires that carries communication/data signals between two
adjacent routers.
4. Channel is a single logical connection between two adjacent routers.
5. Network node is a logical abstraction of a router within a network.
6. Traﬃc ﬂow is a sequence of packets from a source node to a destination.
4 1 Introduction
7. Message is the transfer entity from the network’s clients to the network.
8. Packet: A single message can be split into one or more packets, which represents
transport element of network.
9. Flit i.e. FLow control unIT, is the elementary data unit of the ﬂow control mecha-
nism. Each packet is split into one or more ﬁxed-size ﬂits.
10. Phit is the physical digit, which is the data transmitted per link per cycle.
The basic characteristics of NoC are:
1. Topology: The network topology is the pattern in which network nodes are con-
nected via links, i.e. the arrangement of these nodes and links. A good topology
should meet the bandwidth and throughput requirements of the system at a mini-
mum cost. The path diversity of a topology also determines the performance under
adversarial traﬃc and fault tolerance. The most popular network topology is the
mesh, which has regular structure and equal-length links, so it is easy to layout, as
shown in Fig. 1.1d.
2. Routing is the process of selecting a path for traﬃc from source node to destination
node. Once a topology is selected, the routing algorithm determines the system
throughput. A good routing algorithm should be deadlock-free, can balance traﬃc
load, and typically keep the path lengths as short as possible.
3. Flow control dictates which messages get access to particular network resources
over time. It manages the allocation of channel bandwidth and buﬀer capacity to
packets along the path from source to destination. A good ﬂow control can minimize
packets delivery latency, and can avoid resource conﬂicts and buﬀer overﬂow.
4. Switching model: Generally speaking, the NoC can be grouped into two cate-
gories: packet switching and circuit switching networks. In packet switching
networks, the communication resources, e.g. link and buﬀers, are allocated at each
node for each packet. When a packet reaches a node, the resources are allocated.
When the packet leaves, the resources are released. In contrast, circuit switching net-
works allocate exclusively channels to form a circuit (connection) from source node
to destination, and then the data packets are delivered over this connection, as de-
picted in Fig. 1.2. The comparison between Circuit Switching and Packet Switching
networks is listed in table 1.2.
5. Service class: The traﬃc can be divided into two broad categories: guaranteed
service classes and best eﬀorts (BE) classes . The guaranteed service classes
provide some minimum level of performance, such as a guaranteed loss rate, through-
put, latency, and jitter. In contrast, the network provides no hard guarantees for
best-eﬀort classes.
1.3 Guaranteed Service in NoCs 5
Table 1.2: Comparison between Circuit Switching vs. Packet Switching net-
works
Circuit Switching Packet Switching
Orientation Connection oriented Connectionless
Connection Established Yes No
Resource Allocation Resources are allocated
before data transfer. Not required.
Communication Reliability High Unreliable
Bandwidth Fixed Dynamic
Delay overheads Connection Allocation delay. Packet transmission delay.
Data transmission delay Low Unpredictable
Flexibility Inﬂexible Flexible
Src
Des
Src
Des
Figure 1.2: Circuit Switching network. Each ﬂow has its own dedicated point-
to-point connection to transfer data. Src: source node, Des: destination node.
1.3 Guaranteed Service in NoCs
In modern complex SoCs, many applications have speciﬁc requirements of the perfor-
mance, such as a minimum throughput (for real-time streaming data), bounded latency
(for interrupts, process synchronization, etc). Therefore, providing Guaranteed Services
(GSs), e.g. to guarantee bounded latency, minimum bandwidth and low (or no) data
loss, is crucial for predictable SoC architectures[SMG14]. In general, there are two pop-
ular approaches to provide GS in NoCs: i) the connection-less approach employing pri-
ority scheduling in packet switching networks and ii) the connection-oriented method
based on Circuit Switching (CS) network. Priority scheduling assigns higher priority to
the latency-sensitive ﬂows, as in e.g. QNoC[BCGK04], artNoC[SLB07] and Quota-setting
NoC[CTSM13]. This approach however suﬀers from reduced determinism and predictabil-
ity due to the contention of multiple ﬂows on shared resources. In contrast, the CS allocates
exclusive channels to form a circuit (connection) to the particular ﬂow, so it can provide
hard GS, e.g. used in MANGO[BS05]. After the connection is established, the requested
6 1 Introduction
Connection 
Allocation
Data Transfer
Connection 
Release
Figure 1.3: General operating procedure of Circuit Switching.
bandwidth (link capacity) and bounded/constant end-to-end latency are guaranteed. In
order to achieve hard GS, in this thesis, we focus on CS approach only.
1.3.1 Circuit Switching NoCs
The general operating procedure of CS comprises three phases: connection allocation
(setup), data transfer and connection release, as depicted in Fig. 1.3. In CS, when a
source node wants to send data to a destination, a connection between the source node
and destination should be allocated ﬁrst. Thereafter, the data packets are delivered over
this connection. When the data transfer is ﬁnished, the connection will be released.
In CS, since the resource is exclusively occupied during the entire lifetime of a connection,
it may lead to ineﬃciencies and in-ﬂexibilities for the system due to the blocking of other
traﬃc ﬂows. Hence, there are two extensions to share the resource (links) among multi-
ple ﬂows: 1) Time-Division Multiplexing (TDM) CS and 2) Space-Division-Multiplexing
(SDM) CS.
In TDM CS, the link capacity is split into multiple time slots, and the link is allocated
exclusively to a ﬂow only in speciﬁc time slots, while the other time slots can be used
by other ﬂows, used in e.g. Nostrum[MNTJ04], AEthereal[GDR05, GH10], parallel probe
NoC[LJL14b].
In SDM CS[LJL15, EJ13, LMV+08, RRRM08], the link wires are physically split into
subsets (sub-channels), and only subset of the link wires is exclusively allocated to a
given connection, while the remaining wires of the link can be used by other ﬂows.
1.4 Scope and Outline of this Work
CS is frequently adopted for providing GSs in NoCs. In order to share the resource (links),
two extensions have been proposed: TDM CS and SDM CS. Since the data can be trans-
ferred only after the connection is allocated, the connection allocation is critical to CS
1.4 Scope and Outline of this Work 7
NoCs. In this thesis, we focus on the connection allocation problem, which has to i) allo-
cate a contention-free path between source-destination pairs and ii) allocate appropriate
portions of link bandwidth (appropriate number of time slots and subsets) along the path.
The dissertation is structured as follows.
• In Chapter 2, we ﬁrst explain what is the problem of connection allocation in CS
NoCs, and then present the overview of related work. At the end, we present the
scheme of our allocation approach which is a dedicated allocator (i.e. NoCManager)
based centralized allocation technique.
• Chapter 3 starts by describing the problem of the TDM CS connection alloca-
tion, and then presents the architecture of the dedicated allocation unit, NoC-
Manager, which employs the dynamic programming to solve the connection allo-
cation problem as a trellis path search algorithm, which can solve the shortest path
search problem eﬃciently. The trellis graph consists of two categories: Forward-
Backtrack trellis[CMF16a, CMF16b] and Register-Exchange trellis[CMF17d]. In
Forward-Backtrack trellis, the path search is divided into two steps: forward search
to try to reach the destination, and backtrack from destination to select the sur-
vivor path. However, in Register-Exchange trellis, we only need to do forward search
and the backtrack step can be omitted. As soon as the forward search is ﬁnished,
we can get the survivor path from the destination node immediately. And thereby
compared to Forward-Backtrack approach, the search time is halved. Moreover, in
both approaches, the trellis search can be further grouped into three categories,
unfolded structure[CMF16a], folded structure[CMF16b] and bidirectional structure.
The unfolded structure constructs the trellis graph as multiple stages to represent
multi-hop traversal through the network, but the folded structure only implements
one stage and reuses this stage to do path search. The folded structure is more
eﬃcient for area but the cost for clock cycles is increased to complete single alloca-
tion, as one cycle for traversing single stage. The bidirectional structure starts the
search at two sides, source node and target node simultaneously, so compared to
the traditional search that starts only at the source node, the path search time is
halved. The synthesis results of area, Area.Time product and energy consumption
per allocation of diﬀerent trellis structures are presented and compared. The allo-
cation time and allocation success rate of trellis search are compared to previous
centralized and distributed allocation approaches. Finally, in order to address the
scalability issue, the partitioning structure[CMF17a] that divides the large system
into multiple partitions with multiple local managers is proposed.
• Chapter 4 starts by giving introduction of the SDM CS. In SDM CS, since there
is no time slot scheduling constraint, and any free sub-channel at the next hop
can be allocated, it can provide higher path diversity than TDM CS. However, the
area cost of SDM switch scales as quadratic with the number of the sub-channels.
8 1 Introduction
Moreover, the number of sub-channels is limited by the number of wires, so it can-
not be increased arbitrarily. Hence, we present the combined TDM and SDM CS
technique[CMF17c], in which each sub-channel is further split into time slots. The
trellis path search based NoCManager is employed for the connection allocation of
combined TDM and SDM CS. We also studied the inﬂuence of diﬀerent link parti-
tioning strategies with ﬁxed link wires, i.e. the eﬀect of splitting the link into diﬀerent
number of time slots and sub-channels, on allocation success rate and average path
length.
• In Chapter 5, we present the router architecture[CMF17b] that combines the circuit-
switching network (for GS communication) and packet-switching network (for BE
communication).
• Finally, we draw conclusions and provide some future research directions in Chap-
ter 6.
Chapter 2
Connection Allocation in CS NoCs
This section ﬁrst explains what is the problem of the connection allocation in CS NoCs,
and then presents the overview of known methods and approaches (related work). Finally,
we introduce the idea of our allocation approach that bases on a centralized dedicated
allocator (i.e. NoCManager).
2.1 Connection allocation problem
Some real-time applications have strict time requirements that the data delivery across the
NoC must be in time, so it would be preferred to reserve dedicated transfer resources for
the communication i.e. perform connection allocation. In CS, the connection allocation has
to i) allocate a contention-free path from source node to destination node and ii) allocate
appropriate portions of link bandwidth (appropriate number of time slots in TDM CS and
subsets in SDM CS) along the path. Assume the total link capacity is T , and the link is
split into p portions (number of slots/channels). If an application requires R bandwidth,
then R÷ (T
p
) =
⌈
R·p
T
⌉
slots/channels will be assigned to it. If the path length from source
to destination is m hops, and it takes time of H to route over single hop, then the packet
delivery latency through NoC will be m ·H.
The connection allocation is critical to circuit switching since the data transfer relies
on the allocated connections. If the connection allocation fails, the data transfer cannot
be started at all. If we allocate a short path, we can reduce delivery latency, energy
consumption and resource utilization.
In order to minimize the packet delivery latency and resource cost, usually the goal is to
select the shortest path out among all contention-free paths from source to destination.
This path search problem has exponential complexity with the path length, and the exact
complexity function depends on the network topology, more particularly, on the path
diversity and length parameters [Ste12]. For a mesh network, if a detour is allowed, at
each hop there are up to 4 directions to go, as shown in Fig. 2.1. Assume l is the distance
9
10 2 Connection Allocation in CS NoCs
Src
Des
Figure 2.1: Src tries to ﬁnd a contention-free path to the Des. As the detour
is allowed, during the path search, at each node there are up to 4 directions
to go. Src: source node, Des: destination node.
(i.e. the hop count) between source and destination, d is the number of allowed detours,
then the rough complexity function would be O(4l+d), where 4 is the path diversity and
(l + d) is the length components.
2.2 Related work
The allocation techniques can be grouped into two categories: i) static (design-time)
allocation and ii) dynamic (run-time) allocation.
The static allocation [LJ08, SBSK12, KS14, SKS13, SG11a, MGK14] is done at the design
(compile) time of the system, and usually complicated allocation algorithms are adopted.
The communication patterns and connection requirements are assumed already known
at design time, and the allocation cannot be changed according to the dynamic applica-
tions’ requirements during run time. Moreover, since the static allocation cannot use the
knowledge of real-time network communication traﬃcs, the resource utilization is usually
sub-optimal. Consequently, they are not well suitable for dynamic systems.
In the dynamic connection allocation techniques, the connections are allocated at run-
time according to the real-time applications’ requirements based on real-time network
states. It can be further divided into two categories: i) centralized [SNG12, MMB07,
MBD+05, WF08, WF11, SG11b, PMM15, HCG07, HG07] and ii) distributed allocation
[LJL14b, GDR05, LJL12, LL12a, LL12b, LL11, Hei14].
2.2 Related work 11
2.2.1 Distributed allocation techniques
In a distributed allocation, typically the source node sends a setup signal for searching
a path that traverses through the NoC to try to reach the destination node. The search
signal can be delivered over a dedicated setup network or normal NoC based on some
routing algorithms. The resource is reserved by the search signal hop by hop. When the
search signal reaches the destination, it means the search succeeds and an acknowledgment
signal is sent back to the source node. When the source node receives the acknowledgment,
the connection is established successfully, and the data transfer starts. The distributed
allocation has good scalability, but the problem of distributed allocation is the lack of
the global knowledge, e.g. if there are several concurrent requests, the corresponding
searches might block each other, especially under heavy traﬃc load and high connection
request rate. When the failure of the allocation occurs, an additional mechanism is needed
to tear down the failed partial setup path. Moreover, since the setup search operates
at relatively low clock frequency of the network (compared to the high speed central
manager), the setup latency is increased. Furthermore, usually the distributed approaches
are constrained to search minimal path, which limits the path diversity.
In [LL12a, LL12b, LL11], when a node needs a connection to another node in the net-
work, it sends a best-eﬀort setup packet, which is routed to the destination based on XY
deterministic routing algorithm and reserves a channel in each crossed router along the
path. When the setup packet reaches its destination, an ACK packet is generated. Upon
reception of the ACK, the source then starts transferring data. The problem of this ap-
proach is because of the unpredictable contention of best-eﬀort packets, the setup latency
is not guaranteed. Moreover, the XY deterministic routing can only search one ﬁxed path
between source node and destination node without exploring other possible paths, which
seriously limits the success probability of the connection setup.
In parallel probe search [LJL14b, LJL12, Sha15], the source node sends a setup packet
for searching path that traverses through the NoC along all minimal paths to try to reach
target node. It is a ﬂood-based algorithm which eliminates redundant incoming paths. The
probe search in [LJL12] is proposed for basic CS without link sharing. The probe search
in [LJL14b] is an extended version for TDM CS with a double time-wheel technique used
to make backtrack eﬃcient and to guarantee the setup delay. In this approach, several
trials for success might be needed due to the fact that this method investigates single slot
at a time. For instance, if the link is split into S time slots, it has to search S times in the
worst case to ﬁnd the path or to determine the path is not available. Furthermore, this
approach enables single-slot allocation, but the problem of multi-slot allocation was not
addressed in recent work. Moreover, since the search packets have to ﬂood through the
NoC to route via relatively complex routers, it would cost more energy than dedicated
centralized systems.
Virtual-channel based distributed allocation is proposed in [Hei14], which provides dif-
ferent levels of throughput and latency guarantees for point-to-point connections. The
12 2 Connection Allocation in CS NoCs
weighted round-robin scheduling is used for arbitration. Diﬀerent Service Levels for com-
munication are assigned for diﬀerent applications based on their requirements. End-to-end
connections from one node to another are reserved as a chain of virtual channels. This
method can provide high throughput, but since each ﬂow requires exclusive virtual chan-
nels, the cost of the area is high. Furthermore, the low setup latency is not guaranteed.
2.2.2 Centralized allocation techniques
In a centralized system, a central manager is responsible for connection allocation. Since
the central manager has the global knowledge of the system, it could achieve global opti-
mal results. The centralized system typically is based on software solution. The authors
in [SNG12] e.g. utilize Microblaze processor while an ARM processor is employed in
[SBSK12, MMB07]. Software solutions provide excellent ﬂexibility, however, they might
suﬀer from relatively long allocation time. For instance, single path exhaustive path-search
in [SNG12] tries to add links to the current path if the link provides suﬃcient slots and
is closer to destination. If all links of current node fail, it rolls back to the previous node
and tries to search another direction. Due to sequential investigation of a single link at a
time, and allocation of all required slots on a single path, thousands of processor cycles
are required for single allocation.
In order to increase the allocation speed, HArdware Graph ARray (HAGAR) approaches
[WF08, WF11] proposed a dedicated hardware connection allocator, which can speedup
the allocation by two orders of magnitude against software methods. In HAGAR, the
connection allocation problem is solved as a shortest path problem in a graph represen-
tation of the NoC. However, HAGAR is employed for basic CS, and does not support
link sharing techniques such as TDM and SDM. In paper [PMM15], a centralized hard-
ware unit that uses breadth-ﬁrst path searching algorithm was proposed with excellent
performance. But it is restricted to search of minimal paths, which only considers links
that make the distance to the destination shorter, so it cannot detour when there is no
available minimal path. Moreover, in both HAGAR and breadth-ﬁrst search approaches,
the path search is divided into two steps:
• Forward search i.e. ﬁrstly, the forward search begins at source node that traverses
the network to ﬁnd the path.
• Backtracking i.e. secondly, when destination node is reached, the backtrack starts
from destination to collect the associated survivor path parameters.
Though the centralized system has the advantages of global knowledge and high perfor-
mance, as the network grows and the allocation request rate at the central unit increases,
the central unit might be the bottleneck due to the drawbacks of centralism in computa-
tion and communication.
2.3 Trellis Search based Allocation approach 13
The motivation of this work is to address this problem by a dedicated allocator for con-
nection allocation employing novel trellis-search based Allocation algorithm for TDM and
SDM CS NoCs. The Register-Exchange technique is adopted to merge the forward search
and backtrack into single step to enhance the allocation speed, and the partitioning struc-
ture is proposed to enhance the scalability of centralized system.
2.3 Trellis Search based Allocation approach
In this thesis, we propose a dynamic allocation method using a dedicated centralized
hardware unit called ‘NoCManager’, which solves the problem of connection allocation by
employing a trellis-search based shortest path algorithm. We call this as TrElliS Search
based Allocation (TESSA) approach. The aforementioned shortest path search problem
has exponential complexity with the length of the paths. However, the path search prob-
lem can be eﬃciently solved by the dynamic programming optimization approach that
transforms the complex problem into a sequence of simpler problems and solved stage
by stage, with linear computation complexity [BHM77, Lou95]. Moreover, the dynamic
programming optimized path search problem can be eﬃciently solved by trellis search
approach [LKFF12]. The trellis search can explore all possible paths between two given
nodes within a guaranteed low latency, and can ensure the found path is the contention-
free shortest path. The details of trellis search for TDM CS is presented in chapter 3 and
the trellis search for SDM CS is presented in chapter 4.
14 2 Connection Allocation in CS NoCs
Chapter 3
Centralized Connection Allocation
for TDM CS NoCs
This section presents the trellis path search algorithm for TDM CS connection allo-
cation. In this thesis, we proposed two diﬀerent approaches, Forward-Backtrack (FB)
trellis[CMF16a, CMF16b] and Register-Exchange (RE) trellis[CMF17d]. The Register-
Exchange technique saves the entire path information during the forward search, and
thus compared to forward-backtrack approaches where a backward phase is required to
build the path after the forward search, here the allocation time is reduced by half. More-
over, in both approaches, the trellis graph consists of three categories, unfolded struc-
ture[CMF16a], folded structure[CMF16b] and bidirectional structure [CMMF]. The bidi-
rectional unfolded structure can provide high allocation speed while folded structure is
more eﬃcient in terms of hardware, which will be suitable for diﬀerent scenarios depending
on diﬀerent requirements. Furthermore, the single-layer approach is proposed, which only
needs to implement one layer of the trellis graph, and all slots can be searched simultane-
ously in the single layer. Compared to previous approaches in which multiple layers of the
trellis have to be implemented, the consumption of hardware resource is reduced dramat-
ically. Finally, the partitioning structure[CMF17a] is proposed to address the scalability
issue. The diﬀerent categories of TESSA structures are shown in Fig. 3.1.
3.1 Introduction of TDM CS
In TDM CS NoCs [YZSZ14], the link capacity is split into multiple time slots to be
shared by multiple ﬂows. The allocation information is stored in a slot allocation table of
particular router with one table for each shared resource (a link). The allocation tables
are synchronized such that a ﬂow with slot t allocated at a speciﬁc router, gets slot
(t + 1) mod S at the next hop at neighbor router, where S is the number of slots in
the slot table [GEEK11]. Fig. 3.2 illustrates TDM routing with a router network and
15
16 3 Centralized Connection Allocation for TDM CS NoCs
TESSA
RE TESSA
Partitioned
Unfolded
Multiple 
layer
Bidirectional Unidirectional
Folded
Bidirectional
FB TESSA
Unidirectional
Single 
layer
Non-
Partitioned
Partitioned
Non-
Partitioned
Partitioned
Multiple 
layer
Single 
layer
Non-
Partitioned
Partitioned
Non-
Partitioned
Partitioned
Multiple 
layer
Non-
Partitioned
Partitioned
Multiple 
layer
Non-
Partitioned
Partitioned
Unfolded
Multiple 
layer
Bidirectional Unidirectional
Folded
Bidirectional Unidirectional
Single 
layer
Non-
Partitioned
Partitioned
Non-
Partitioned
Partitioned
Multiple 
layer
Single 
layer
Non-
Partitioned
Partitioned
Non-
Partitioned
Partitioned
Multiple 
layer
Non-
Partitioned
Partitioned
Multiple 
layer
Non-
Partitioned
Figure 3.1: The classiﬁcation tree of TESSA structures.
its corresponding slot tables. In this case, each link bandwidth is split into four time
slots, and thus can be shared simultaneously by at most four diﬀerent ﬂows. The network
contains four routers, R0, R1, R2, and R3. The three arrows labeled a, b and c represent
ﬂows. Router R0 switches ﬂow a from input port i3 to output port O1 at time slot 0, as
slot table 0 indicates. Similarly, R1 switches ﬂow a from input i3 to output O2 at slot
1, as slot table 1 indicates. At slot 2, R2 switches ﬂow a from i0 to O2. Hence, ﬂow a
travels along path R0→ R1→ R2 with slot sequence {0, 1, 2}. The TDM CS routing is
contention-free as there is at most one input port to each output port at single time slot.
3.2 System Model 17
slot o1
0
1
2
3
i3
slot o2 o3
0
1
2
3
i0
i0
i0
Slot table 2
O1      i1
i3      O3
i0
O0
 
O2
i2
O1      i1
i3      O3
i0
O0
 
O2
i2
O1      i1
i3      O3
i0
O0
 
O2
i2
a
b
c R2
R0 R3
Slot table 0
O1      i1
i3      O3
i0
O0
 
O2
i2
R1
slot o2
0
1
2
3
i0
i3
i0
Slot table 1
Figure 3.2: Contention-free TDM CS routing.
In TDM CS NoCs, the latency is guaranteed by allocating the contention-free shortest
path, and the bandwidth is guaranteed by the number of slots allocated to the traﬃc
ﬂow. For example, if a traﬃc ﬂow requires half of the link bandwidth, if the size of the
slot table is four, two slots will be assigned to that ﬂow.
3.1.1 Connection allocation in TDM CS
In connection-oriented TDM-CS communication, the connection allocation has to ﬁnd the
contention-free path from source node to destination and allocate slots along the path. An
example of connection allocation is shown in Fig. 3.3. The source node sends out search
ﬂits to try to reach destination node. After each hop, the available slots on the path may
become less. At some nodes, there may be no available slots at all, and thus those nodes
will discard the search ﬂit, as shown in Fig. 3.3 at node 2. When the destination is reached,
the backtrack starts from destination to select the path and time slots. Consequently, the
path from source to destination is selected as Src → N0 → N1 → Des, and the slot
sequence along the path is {1, 2, 3}.
3.2 System Model
The system model of a dedicated allocator (i.e. NoCManager) based NoC architecture is
illustrated in Fig. 3.4. The NoCManager (NoCM) attempts to allocate the appropriate
connections when it receives connection requests. In order to reduce communication cost,
the NoCM is connected to the center node of the NoC.
There are three possible schemes for the communication between NoCM and NoC nodes,
i.e. over a dedicated conﬁguration network, connected via dedicated wires, or via the
18 3 Centralized Connection Allocation for TDM CS NoCs
0 1
2 Des
1 2 30
Src
1
2
3
0
1
2
3
0
1 2 30
1
2
3
0
1 2 30
Forward Search
Saturated link
Backtrack
Available slots on 
the Path
Selected slots
Unavailable slots
Figure 3.3: A connection allocation example for TDM CS. Src: source node,
Des: destination node.
existing NoC. In [SMG14, PMM15, MTSA10, JPL08], the communication between the
NoC and the manager is via a dedicated conﬁguration network. The dedicated network
can provide high speed, but is costly in terms of hardware resources. In HAGAR[WF08,
WF11], the central manager communicates with the NoC via dedicated wires. This can
provide high speed, but it may pose wiring diﬃculty for large network in chip design.
In Lusala’s work[LL11, LL12a, LL12b] and Wolkotte’s work [WSRS05], the conﬁguration
packets are delivered over the NoC as best-eﬀort packets. The cost of hardware is relatively
low, but the problem is that some resources have already been reserved by other GS ﬂows,
so the conﬁguration packet has to try node by node to get the free path to reach the target
node. In addition to the unpredictable contention of best-eﬀort packets, the connection
conﬁguration latency is not guaranteed. In order to achieve high allocation speed, in this
thesis the source node sends the connection request to NoCM via dedicated wires. In a
mesh network, for each source node, log2M bits wires are needed to indicate which node
is the destination, where M is the number of nodes in the NoC. Due to the partitioning
structure idea that divides the large system into multiple smaller logic partitions with
multiple local managers (explained in section 3.7), each NoCM only manages and connects
a limited number of nodes in its local region, so the dedicated NoCM-node wires will not
be much overhead. The allocation information from NoCM to source is delivered over the
NoC as GS packet, with the associated path found by NoCM as GS path.
In NoCM, the connection allocation comprises three steps: receive connection request
from source node, ﬁnd the required contention-free path and send back the allocation
information to source node. The complete procedure for connection allocation is as follows:
1. The source node sends the connection request to NoCM. These requests are buﬀered
in a request queue in NoCM.
3.2 System Model 19
Des
Src
NoCM
2. Allocation info
3. Connection 
1. Connection request
4. Connection release
(a) System Model of the NoC
NoCNoC
NoCM
Connection request
Path 
search
Allocation info
Des
GS data
GS connection
Src
NoCM
Connection release
Data 
transmit
(b) Connection request processing pro-
cedure diagram
Link state 
memory
Request queue
Path search
Incoming 
req
Allocation 
info
(c) Block diagram of the request processing procedure in NoCM.
‘Link state memory’ stores the current state of links.
Figure 3.4: Proposed System model of the NoCManager based NoC platform.
Src: source node, Des: destination node.
2. NoCM processes the requests within the request queue, searching the best
contention-free path between the source node and destination.
3. NoCM sends the allocation information (i.e. the information of the found out best
path) to source node in the case of success or retries later if it fails. A failed request
is retried only when the retried time does not exceed the deadline and there are no
unprocessed requests waiting, otherwise it is discarded.
4. After receiving the allocation information, the source node starts to transmit data
along the allocated connection. The implementation details are presented in chapter
V.
5. After the data transfer is ﬁnished, the source node deletes the allocated connection,
and also informs the NoCM to free the corresponding allocated resources. The release
information to NoCM is sent as BE packet.
20 3 Centralized Connection Allocation for TDM CS NoCs
GS request 
queueRetry queue
link state 
memory
Deactivate
Path 
deallocate
F
re
e
L
in
k
 &
 S
lo
t
Path search 
failed
TESSA Unit
Allocation 
info
R
e
try
 
d
e
a
d
lin
e
?
No
Retry 
again
To NoCFrom NoC
Trellis 
Path search Succeed
Incoming 
req
Incoming 
release
Discard
Yes
Figure 3.5: Block diagram of the NoCManager
Where the allocation information contains the connection information to indicate the
speciﬁc time slots that the data is inserted into the network at source node, and the
information for each hop to indicate which output port to go (3 bits for ﬁve directions in
2-D mesh network, east, west, north, south and local).
Step 2 is the main contribution of this thesis, and is explained in details in the following
sections.
3.3 Connection Allocator Architecture
The NoCManager solves the problem of connection allocation as a shortest path problem
in a trellis graph description of the NoC, in order to ﬁnd the shortest contention-free path
and allocate slots between two given nodes. A block diagram of the NoCM is shown in
Fig. 3.5 and comprises trellis path search module and link state memory. NoCM collects
and processes the incoming connection requests within the request queue. The resulting
allocation parameters are sent through the NoC to respective source node.
3.3.1 Formalizing The Trellis Graph Structure
The aforementioned shortest path search problem can be eﬃciently solved by the dynamic
programming approach that transforms the complex problem into a sequence of simpler
problems and solving them step by step. Moreover, the path search algorithm comprising
successive NoC traversal from source to destination node can be represented by trellis
graph, which is similar to the popular Viterbi algorithm. The transformation of the NoC
3.3 Connection Allocator Architecture 21
architecture to associated path search trellis graph is illustrated in Fig. 3.6. The trellis
graph is a time indexed version of the NoC graph.
Note, though in this thesis we restrict the trellis graph to 2D-mesh topology at analysis, it
can be applied to any other topologies even in mesochronous or asynchronous networks.
The mesochronous or asynchronous TDM NoCs can use the synchronization token for
handshaking as in aelite [HSG09].
3.3.1.1 General Model
There are ﬁve important characteristics of the trellis graph, which are:
1. Stages: The network traversal can be mapped to trellis graph which is structured
into multiple stages (Fig. 3.7a), while stage n is represented by set of the net-
work nodes reached after n hops starting from the initial source node. Trellis graph
branches (edges) are associated with node-to-node hops representing available NoC
links. The path search problem is solved sequentially one stage at a time. The stage
(i.e. the column) of the trellis graph is a collection of whole nodes of the network,
which is deﬁned to represent the traversal through the network of single hop. We
refer to the “decision stages”, meaning the number of stages which have to be tra-
versed to make decision, not counting the ﬁrst stage, since the ﬁrst stage does not
require any decision making. By default, the number of the “decision stages” is
2N − 2 for an N ·N mesh network, which equals the longest minimal path (i.e. the
longest possible path of all minimal paths) in the network. The “decision stages”
number can be increased to allow detours. For example, if the search is allowed to
take a detour of d hops, then the trellis is constructed as d+2N −2 decision stages.
Each decision stage is a representation of the network. 1 The stages have time im-
plications to represent diﬀerent time slots associated with diﬀerent hops. As in Fig.
3.6b, assume at the initial stage (stage 0) the associated time slot is t, then at stage
n, the associated time slot will be (n + t) mod S. If we assume the initial slot is
slot 0, then the trellis graph can be simpliﬁed as Fig. 3.6c. The decision variable in
any node is to select an incoming branch as survivor path. At any stage, we only
need to know which node we are in to be able to make subsequent decisions. The
subsequent decisions do not depend upon how we arrived at the particular node.
2. States: The node in the trellis graph, called state, summarizes the knowledge in
order to make the current decisions. At each stage, the decision in a particular state
is determined simply by choosing one and only one of the active incoming branches
as survivor path, i.e. the previous node that this branch connected to is remembered
as predecessor.
1 In the following sections, the decision stage is called as stage for short.
22 3 Centralized Connection Allocation for TDM CS NoCs
0 1
2 3
(a) Example network graph.
A node can reach itself
(curve arrow).
0
1
2
3
0
1
2
3
Time slot index
(n+t) mod S (n+1+t) mod S
n n+1
stage
(b) A trellis graph represents the net-
work graph. Assume the initial time
slot at the initial stage is slot t.
0
1
2
3
0
1
2
3
Time slot index
n mod S (n+1) mod S
stage
(c) Simpliﬁed trellis graph. Assume
the initial time slot at the initial
stage is slot 0.
Figure 3.6: Network graph represented by trellis graph
3.3 Connection Allocator Architecture 23
0
1
2
3
0
1
2
3
stage
0
1
2
3
0
1
2
3
0 1 n-1 n
(a) Multiple stages of a trellis graph of the example network.
0
1
2
3
0
1
2
3
n-1 n
P2,n-1
B2,3,n
Path metric
P1,n-1
P3,n-1
B1,3,n
B3,3,n
0
1
2
3
n+1
P3,n= min[Pi,n-1+Bi,3,n ]    
(b) Branch metric and path metric at node 3
Figure 3.7: Multiple stages trellis graph
3. State transitions: The forward move from one state at a stage to another allowable
state at the next stage in one unit of time (a time slot), is called a state transition
which, in the trellis graph, is represented by a directed edge (or branch) connecting
the two states. The state transitions act as the links between two respective routers
corresponding to a forward step in the network.
4. Branch metric: The branch metric (B) is a measure of the transition that reﬂects
the value (importance) of the branch. It is a function of several variables, such as the
available slots of the branch (a), the number of requested slots (r), and the weight
(w) that reﬂects the priority of the branch, etc. The information of available slots and
24 3 Centralized Connection Allocation for TDM CS NoCs
the requested slots can be used to balance network load. The fewer available slots
the branch can provide and the more slots the connection requested, the larger the
branch metric will be, which indicates more inferior the branch is. When the number
of requested slots exceed the available slots that branch can provide, the branch
metric becomes inﬁnity, which indicates that branch cannot satisfy the request and
will be discarded. The function of the branch metric is as follows:
B =
f(r, a, w), r 6 a+∞, r>a
An example of branch metric function might be:
B =

1
a
r·w, r 6 a
+∞, r>a
5. Path metric: The path metric is the minimal accumulated branch metric over the
shortest path from the initial state to the current state. For each state, the incoming
branch that produces minimal path metric is selected as survivor path. The branch
metric for the transition from state i to state j at stage n is deﬁned as:
Bi,j,n
Deﬁne Pj,n as the path metric for state j at stage n, and Sj is the set of states that
have transitions to state j, then:
Pj,n = min
i∈Sj
[Pi,n−1 +Bi,j,n]
According to trellis graph example in Fig. 3.7b,
P3,n = min{P1,n−1 +B1,3,n, P3,n−1 +B3,3,n, P2,n−1 +B2,3,n}
If the branch from state 2 produces the minimal path metric, the path metric of
state 3 at stage n will be
P3,n = P2,n−1 +B2,3,n
The goal of the shortest path problem is to ﬁnd the path between source and des-
tination node with the minimal path metric. At the last stage, the survivor path
with the minimal path metric is the desired path.
3.3 Connection Allocator Architecture 25
3.3.1.2 Path Search Model Simpliﬁcation
The previous section presents the general formalized model of trellis graph. However, the
branch metric can be simpliﬁed. In this work, a simpliﬁed model has been employed that
the branch metric can only have two possible values, either 1 or inﬁnity, as follows:
B =
1, r 6 a+∞, r>a
Therefore, the accumulation operation of branch metric in path metric can be omitted.
As long as the branch metric is inﬁnity, that branch will be discarded; otherwise, if the
branch metric is 1, that branch can be selected. The simpliﬁed path metric becomes:
Pj,n = min
i∈Sj
[Bi,j,n]
In our system, every initial slot at the initial stage has such a representation of trellis
graph, i.e. if the slot table size is S, there are S representations (layer) of trellis graph in
parallel, as in Fig. 3.8. So every slot from the initial stage has its own trellis graph and
can search its own path in parallel, as in Fig. 3.9. Consequently, during a search, at each
stage we only need to know the branch state at a speciﬁc slot. Since the branch at each
slot only has two possible status, either free or unavailable (i.e. already allocated), the
branch metric of each slot t can be simpliﬁed as: 2
Bt =
1, branch is free+∞, branch is unavailable
If a ﬂow requires a large bandwidth, i.e. multiple slots, since in the proposed approach
each slot searches its own path in parallel, the allocation of multiple slots can be done
simultaneously and can set up multiple paths, which can split the bandwidth over multiple
paths, as shown in Fig. 3.10. Compared to these approaches [SNG12] that allocate the
whole bandwidth along single path, our multi-path allocation can increase the success
rate signiﬁcantly [SG11b]. It is worthwhile to mention that the state transition structure
reﬂects the speciﬁc NoC topology. However, the dimension of trellis graph depends only
on the number of NoC routers, and thus is topology invariant.
3.3.2 Forward-Backtrack Trellis Path Search
In this section, three diﬀerent trellis structures are presented and investigated, called
unfolded trellis graph, folded trellis graph and bidirectional trellis graph. The trellis graph
2 In hardware implementation, the value ’1’ of the branch state register indicates the branch is free, and
the value ’0’ of the branch state register indicates the branch is unavailable.
26 3 Centralized Connection Allocation for TDM CS NoCs
0
1
2
3
0
1
2
3
0
1
2
3
Time slot index
(n+t) mod S (n+1+t) mod S (n+2+t) mod S
n n+1 n+2
stage
(a) One layer of the trellis graph. Assume the
initial slot at the initial stage is t.
Time slot index
(n+t) mod S (n+1+t) mod S (n+2+t) mod S
n n+1 n+2
stage
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
S layers
Initial slot 0
Initial slot 1
Initial slot S-1
0
1
2
3
0
1
2
3
0
1
2
3
(b) Multiple layers of the trellis graph. Each initial slot at the
initial stage has its own layer.
Figure 3.8: Each slot at the initial stage has its own layer of trellis.
3.3 Connection Allocator Architecture 27
0
1
2
3
Slot
Slot
R0
R1
R2
R3
0
1
2
3
0
1
2
3
Slot
0
1
2
3
Slot
Figure 3.9: Each slot searches its own path in parallel.
Src
0.5
0.3
0.2
0.4
0.5
0.2
Des
0.3
0.1
0.6
0.3
0.4
0.2
Figure 3.10: Communication bandwidth split over multiple paths. Src: source
node, Des: destination node.
is constructed as multi-layer, but each layer is identical to each other, so in this section
we take one layer as example to show the trellis path search.
The ﬂow chart of the path search procedure is illustrated in Fig. 3.11. In general, the
shortest path search in Forward-Backtrack trellis graph comprises two steps:
1. Forward search i.e. traverse the NoC from source node to ﬁnd the best free path
to the destination.
(a) Search through the trellis, and calculate branch metric to determine whether
28 3 Centralized Connection Allocation for TDM CS NoCs
Start at Src
Search signals propagate 
through the trellis to try to 
activate neighbors
Destination is 
activated?
No
Backtrack to collect the 
survivor path
Yes
Maximum 
number of stages 
reached?
No
Fail
Yes
Src is found?
Finish
Yes
No
Backtrack starts at 
destination
Figure 3.11: The ﬂow chart of the path search in trellis.
the current node can be activated, i.e. both the incoming branch and its asso-
ciated link state register are available;
(b) If the current node can be activated, save one of the active incoming branches
as survivor path;
(c) Continue the search to try to reach the destination node until reaching the last
stage.
2. Backtracking i.e. sort out the saved survivor path from destination to backtrack
the shortest path and collect the associated path and slot allocation parameters.
(a) Read stored predecessor from current node;
3.3 Connection Allocator Architecture 29
(b) Output this “predecessor”;
(c) Use the predecessor to read out its previous state in the trellis;
(d) Repeat until ﬁnd the source node.
3.3.2.1 Unfolded Trellis Search
Since the NoC topology and size are already known at design time, the respective trellis
graph can be constructed as e.g. for 2x2 NoC illustrated in Fig. 3.12a the associated
trellis graph is constructed in Fig. 3.12b. Assume node 0 is the source and node 3 is
destination. The search signals start from source at the ﬁrst stage, traversing through
the trellis to try to activate its connected neighbors at next stage. Source activates its
connected neighbors node 1 and node 2 at the second stage, and node 1 and node 2 try
to activate their connected neighbors at the third stage. Assume the branch N1→ N3 at
second stage is already occupied, so node 1 cannot activate node 3. During the forward
search, if a node is activated by several nodes at the same stage, one and only one (any
one) is remembered as its predecessor. If the node was already activated in the previous
stage, it will store itself as the predecessor. Hence, it can ensure the selected path is
the shortest between the source and target nodes. So suppose if the destination, node 3,
is activated by both node 2 and node 1, but only one node will be remembered as its
predecessor. When the destination is active, backtracking is started from destination to
backtrack predecessors in order to collect the path information. Node 3 backtracks its
predecessor node 2 at second stage, and node 2 backtracks node 0 at ﬁrst stage. Now the
path from source to destination is obtained as N0 → N2 → N3. Assume the beginning
slot at source is t, then we can obtain the slot sequence along the path as {t, (t+1) mod S,
(t+ 2) mod S}, where S is the slot table size.
3.3.2.2 Folded Trellis Search
The proposed unfolded trellis path search algorithm exhibits regular structure and, hence,
it can be eﬃciently mapped on to a folded architecture. In such case, the hardware re-
sources are reused by all partitions of folded algorithm. The folded architecture requires
additional output registers in order to hold the values of intermediate results to be used
as input values in the next iteration. The folded path search algorithm of example in Fig.
3.12b is illustrated in Fig. 3.12c, in which only one decision stage has to be implemented
(does not count the ﬁrst stage). There is a register for each node to store which predeces-
sor activates it, and it only stores the predecessor that activates it ﬁrst. When a node is
active, at next cycle it will forward search signal to its ﬁrst stage node, and does the search
propagation again. Note now the search costs multiple cycles, i.e. one cycle per iteration.
The search can be stopped in two cases: i) either the target node has been activated with
suﬃcient bandwidth or ii) after certain number of iterations (by default 2N−2 iterations)
30 3 Centralized Connection Allocation for TDM CS NoCs
0 1
2 3
(a) Example NoC graph
0
1
2
3
0
1
2
3
0
1
2
3
Time slot index
n mod S (n+1) mod S (n+2) mod S
Forward search
Backtracking
stage
Src
Des
Link unavailable,
search failed
(b) Unfolded Trellis Graph Search
0
1
2
3
0
1
2
3
regSrc
Des
(c) Folded Trellis Graph Search
0
1
2
3
0
1
2
3
0
1
2
3
Src
Des
Search meet?
(d) Bidirectional Trellis Graph Search
Figure 3.12: a)2x2 2D-mesh example NoC; b)schematic structure of the
unfolded trellis Search for the example NoC; c)schematic structure of the
folded trellis Search; d)schematic structure of the bidirectional trellis Search.
Src: source node, Des: destination node.
(or we can stop the search when there are no new nodes being activated during the search
any more). Hence, the livelock is avoided. In the routing algorithm that oﬀers several
alternative paths to the destination, packets may arrive out of order at the destination. In
the traditional routing algorithms that oﬀer several alternative paths to the destination,
like in [SG11b], a complicated in-order path selection mechanism is required to ensure
packets arrive at the destination in order. This problem can be solved by only choosing
3.3 Connection Allocator Architecture 31
paths that ensure in-order delivery. Assume there are two paths beginning at source node
at slot t1 and t2 (t1 < t2), with path length l1 and l2 respectively. The paths are selected
only if t1 + l1 < t2 + l2, which ensures in-order delivery. However, here, we can have a
simple solution. We can only select the paths that reached the destination at the last cy-
cle, which ensures the selected paths have the same length and thus the in-order delivery
is guaranteed, so the complicated mechanism of in-order path selection can be omitted.
The search signals start from source and activate node 2. At the next cycle, node 2 travels
back to the node 2 at ﬁrst stage and continues to activate node 3. The backtrack starts
from destination node 3 and sorts out nodes in sequence N3→ N2→ N0. Therefore, the
path from source to destination is acquired as N0→ N2→ N3.
3.3.2.3 Bidirectional Trellis Search
The path search presented in previous sections is started at one side, i.e. from source
node to target node. Actually, the search can be started at two sides, source and target
nodes simultaneously, and checked at the middle stage to see whether search signals from
two sides meet to determine whether the search was successful or not. The bidirectional
path search algorithm of example in Fig. 3.12b is illustrated in Fig. 3.12d. The search
from source activates node 2, and the search from destination also activates node 2. In
the middle stage, the search signals from source and destination meet at node 2, which
means the search succeeds. The backtrack starts from node 2 to source and destination
simultaneously. The path from source to destination is obtained as N0 → N2 → N3. In
bidirectional search, the critical path is halved while the area stays almost the same.
3.3.3 Forward-Backtrack Trellis Path Search Implementation
This section presents the implementation details of unfolded trellis, bidirectional trellis
and folded trellis.
The Detect-Select-Shift (DSS) Unit is the core module that implements the function of
state in the trellis graph, which evaluates the propagated search signals from previous
stages and generates bit-vector ﬂags representing slot availability on speciﬁc links. The
knowledge about the actual link allocation state is stored in the ‘Link State’ register.
When a link is allocated at a speciﬁc slot, its corresponding state register is set to ‘0’
excluding it thereby from future search. Correspondingly, state register is set to ‘1’ when
the link is released.
The details for an example DSS of node 3 in the example NoC with slot table size of 2
are shown in Fig. 3.13. Since the implementation for each slot is the same, we take one
slot as example. The working ﬂow for each slot in DSS unit is as follows:
1. Detect the available slot: The search signal from N1→ N3 and its corresponding
‘Link State’ register are connected to an AND gate. If the search signal as well as
32 3 Centralized Connection Allocation for TDM CS NoCs
DSS =
N1 N2N3
slot 0
OR
AND AND
slot 0 
link state
register
N1    N3
slot 0 
link state
register
N2    N3
slot 0 slot 1
slot 0
search 
signal
N1 N2N3
slot 1
OR
AND AND
slot 1 
link state
register
N1    N3
slot 1 
link state
register
N2    N3
slot 1
search 
signal
Cyclic Shift
Incoming search signal 
from previous stage
Output signal 
to next stage
Figure 3.13: Implementation details of an example DSS unit of node 3
the corresponding link are both valid, the current node can be activated by this
search.
2. Select active incoming branch as survivor path: The detect signals from the
neighbor (predecessor) nodes (i.e. the output of AND gates) that are at the same
slot are connected to an OR gate. If the node can be activated (i.e. the output of
OR gate is ‘1’), one of the active incoming branches is saved in register as survivor
path.
3. Cyclically shift slot: Cyclically shifts slots to synchronize with next hop. The
search signal at slot t after shifting comes to slot (t+ 1) mod S, where S is the slot
table size. The cyclic shift is realized by wire connection.
We can see (in Fig. 3.13) the critical path of DSS unit is only an AND gate and an OR
gate, which is quite simple. For all the three diﬀerent trellis structures, unfolded trellis,
bidirectional trellis and folded trellis, the implementation structure of DSS unit is the
same.
3.3.3.1 Unfolded Trellis Implementation
The implementation schematic of the unfolded trellis is shown in Fig. 3.14. The path search
is completed in two cycles. At the beginning search begins at the source node by setting
its all slots to logic ‘1’ (i.e. be valid), then the search signals propagate forward along the
edges to the connected neighbors via DSS Unit that checks the slot’s availability. The still
valid signals continue to propagate in this way until the end of the trellis is reached where
3.3 Connection Allocator Architecture 33
clock
N0
N1
N2
N3
In
te
n
d
e
d
 T
a
rg
e
t O
n
?
Chosen Nodes  Path 
edge’s 
width=#slots
DSS
DSS
DSS
DSS
DSS
DSS
DSS
DSS
N0
N1
N2
N3
N0
N1
N2
N3
Forward 
Search
Path
Backtrack
S = slot table size
n (n+1)mod S
Time slot index
(n+2) mod S
Src
Des
Figure 3.14: Implementation schematic of the unfolded trellis for the example
NoC
Chosen Nodes  Path
DSS
DSS
DSS
DSS
DSS
DSS
DSS
DSS
AND
AND
AND
AND
N0
N1
N2
N3
Src
Des
Search meet?
N0
N1
N2
N3
Figure 3.15: Implementation schematic of the bidirectional trellis for the
example NoC
a register stores which of the nodes could be reached through the NoC. The bandwidth
(i.e. the number of available slots) is only checked at the last stage.
At the next cycle, the backtrack is started with each selected slot backtracking its own
path if the intended target node was active. The path is selected by reading the stored
predecessors starting at the intended target. Each selected node at each stage updates
that it was selected into the ‘chosen node’ register, where eventually the complete path
from source to target node can be found. The selected slots’ corresponding ‘Link State’
register will be set to ‘0’.
34 3 Centralized Connection Allocation for TDM CS NoCs
S= slot table size
B
K
_
M
U
X
B
K
_
M
U
X
Sel=2’b00
predecessor[S-1:0]
reg
Backtracking
Des
Reversed 
Cyclic Shift
S
Sel=2’b01
Sel=2’b10
Sel=2’b11
MUX
DSS
DSS
DSS
DSS
N3
N0
N1
N2
Next Cycle
reg
Forward
Edge’s width= S
Sel
Figure 3.16: Implementation schematic of the folded trellis for the example
NoC
3.3.3.2 Bidirectional Trellis Implementation
The implementation schematic of the bidirectional trellis is shown in Fig. 3.15. The search
is started at source (at initial stage) and destination (at last stage) simultaneously, prop-
agated until reaching the middle stage. At the middle stage, the corresponding search
signals from two sides are connected to an AND gate to check whether searches meet. At
the next cycle, if two searches meet, backtrack starts from the middle stage by reading
out the stored predecessor hop by hop. Each selected predecessor at each stage are saved
into the ‘chosen node’ register, where eventually the complete path from source to target
node can be found.
3.3.3.3 Folded Trellis Implementation
The search begins at the source node by setting its all slots to logic ‘1’, then the search
signals propagate forward along the edges to the connected neighbors via DSS Unit. At
the next cycle, the still valid signals travel back to the ﬁrst stage and try to activate its
connected neighbors. It continues to propagate in this way until reaching the target node
or exceeding the limited number of search cycles.
If the destination node is activated with suﬃcient slots, the backtracking is started with
each selected slot backtracking its own path simultaneously. The path is selected by read-
ing the stored predecessors starting at the destination. The current node that is read out
3.4 Performance Evaluation of Forward-Backtrack trellis 35
at last cycle will request its predecessor via multiplexer BK_MUX (in Fig. 3.16). Hence,
the predecessors are sorted out in this manner until source node is obtained.
It should be noted that the data ﬂit can wait in the node for several cycles until there
is available path, to increase the success rate, which is diﬀerent from the consecutive
allocation in [LJL14b, SNG12]. For example, if a node is reached by search signal at slot
0, but the connected downstream neighbor is not available at slot 1 but available at slot
2, the search signal can wait in the current node for one time slot and then reaches the
connected neighbor at slot 2.
3.4 Performance Evaluation of Forward-Backtrack
trellis
The synthesis and simulation results of our NoCManagers are presented in this section.
3.4.1 Synthesis Results
The NoCManager is designed in synthesizable VerilogHDL and can be generated out of
an XML description for diﬀerent NoC sizes. Using Synopsys Design Compiler, the NoCM
was synthesized with TSMC 65 nm technology, for diﬀerent mesh networks of size from
4x4 to 10x10. For folded TESSA, the critical paths were constrained to 1 nanosecond. For
unfolded TESSA and unfolded bidirectional TESSA, the critical path constraints were
gradually increased based on the increased NoC size. For example, in unfolded bidirec-
tional TESSA, the critical path is constrained to 1.11 ns for 6x6 mesh while increased to
2 ns for 10x10 mesh.
Since in TESSA all stages are identical, we can only implement one decision stage as
folded architecture, which can reduce hardware resource remarkably but requires more
cycles to do path search. Hence, we can combine the folded and unfolded structures to
fold several stages instead of one stage to provide a suitable tradeoﬀ between area and
performance. Therefore, we implemented a TESSA approach that folds half stages, called
half-folded TESSA, i.e. in N · N mesh, we implemented N − 1 stages, and the forward
search can be ﬁnished in two cycles at most.
The synthesis results of the four diﬀerent structures are presented and compared in this
section. The performance of the four diﬀerent structures are compared in terms of area,
average Area · Time (AT) complexity per allocation and average energy consumption
per allocation. For diﬀerent structures, the allocation time per allocation is diﬀerent. In
N · N mesh, the average allocation time per allocation of diﬀerent structures3 is shown
as follows:
3 The clock frequency of diﬀerent structures is diﬀerent.
36 3 Centralized Connection Allocation for TDM CS NoCs
10 20 30 40 50 60 70 80 90 100
0
50
100
150
200
250
300
#Routers
A
re
a/
 1
04
 µ
m
2
 
 
#slots=4,Unfolded TESSA
#slots=8,Unfolded TESSA
#slots=16,Unfolded TESSA
#slots=4,Half−folded TESSA
#slots=8,Half−folded TESSA
#slots=16,Half−folded TESSA
#slots=4,Folded TESSA
#slots=8,Folded TESSA
#slots=16,Folded TESSA
#slots=4,Bidirection TESSA
#slots=8,Bidirection TESSA
#slots=16,Bidirection TESSA
Figure 3.17: Area of diﬀerent TESSA in diﬀerent size NoC with diﬀerent
slot table size
• In unfolded TESSA, two cycles;
• In bidirectional TESSA, two cycles;
• In folded TESSA, 2 · (N − 1) cycles, because average path length is N − 1;
• In half-folded TESSA, 3 cycles. Because the search can be ﬁnished in the ﬁrst
iteration or in the second iteration, i.e. the allocation time can be two cycles or four
cycles, so in average three cycles.
From the area synthesis results shown in Fig. 3.17, we can see the area of unfolded TESSA
grows with O(S· M· √M) in 2D-mesh (M= #routers, S=slot table size), where √M is
related to the number of trellis stages (2· (√M -1)). The area of folded TESSA grows with
O(S· M) in 2D-mesh. As folded TESSA reuses hardware, its area cost is the least. From
the AT complexity results shown in Fig. 3.18, we can see the bidirectional TESSA presents
the best results considering area and time product, which is due to the halved search time,
while the AT complexity of unfolded TESSA is the worst. The energy consumption per
allocation is shown in Fig. 3.19. Still, the bidirectional TESSA presents the best energy
eﬃciency. Since in folded TESSA, additional registers are used to store the intermediate
results, its energy eﬃciency is the worst. Compared to a microcontroller based software
solutions [SNG12, MMB07, MBD+05], in which energy consumption is roughly up to
tens of nanojoules (hundreds of cycles) per allocation in 4x4 mesh network, our eﬃcient
hardware solutions that only cost tens of PicoJoules per allocation would be 1000X more
energy eﬃcient.
The comparison of diﬀerent TESSA structures is shown in table 3.1. In conclusion, the
folded structure is the most area eﬃcient, while the bidirectional structure is the best in
3.4 Performance Evaluation of Forward-Backtrack trellis 37
10 20 30 40 50 60 70 80 90 100
0
500
1000
1500
2000
2500
#Routers
A
T 
pe
r 
al
lo
ca
tio
n 
(1
04
 µ
m
2 /
G
H
z)
 
 
#slots=4,Unfolded TESSA
#slots=8,Unfolded TESSA
#slots=16,Unfolded TESSA
#slots=4,Half−folded TESSA
#slots=8,Half−folded TESSA
#slots=16,Half−folded TESSA
#slots=4,Folded TESSA
#slots=8,Folded TESSA
#slots=16,Folded TESSA
#slots=4,Bidirection TESSA
#slots=8,Bidirection TESSA
#slots=16,Bidirection TESSA
Figure 3.18: Average AT complexity per allocation of diﬀerent TESSA in
diﬀerent size NoC with diﬀerent slot table size
10 20 30 40 50 60 70 80 90 100
0
500
1000
1500
2000
2500
3000
#Routers
E
ne
rg
y 
pe
r 
al
lo
ca
tio
n 
(P
ic
oJ
ou
le
)
 
 #slots=4,Unfolded TESSA
#slots=8,Unfolded TESSA
#slots=16,Unfolded TESSA
#slots=4,Folded TESSA
#slots=8,Folded TESSA
#slots=16,Folded TESSA
#slots=4,Bidirection TESSA
#slots=8,Bidirection TESSA
#slots=16,Bidirection TESSA
Figure 3.19: Average Energy consumption per allocation of diﬀerent TESSA
in diﬀerent size NoC with diﬀerent slot table size
terms of AT complexity and energy eﬃciency. Another advantage of bidirectional structure
is its high allocation speed. Hence, in dynamic systems with high connection request rate,
bidirectional structure will be more preferable. In diﬀerent scenarios, diﬀerent appropriate
structures can be selected depending on speciﬁc system requirements.
38 3 Centralized Connection Allocation for TDM CS NoCs
Table 3.1: The comparison of diﬀerent TESSA structures.
Unfolded Bidirectional unfolded Folded Half-folded
latency per allocation low lowest high average
complexity O(S· M· √M ) O(S· M· √M ) O(S· M) O(S· M· √M)
area cost high high low average
AT product high low average average
energy eﬃciency good good poor good
3.4.2 Simulation Results
In our TESSA, if a ﬂow requires multiple slots, these multiple slots can be allocated
along multiple paths, which can split the bandwidth over multiple paths to increase the
success rate. In this section, in order to evaluate the inﬂuence of multi-path allocation of
TESSA on success rate, we also realized a single path solution that employs the trellis
graph algorithm for path search but allocates all required slots (bandwidth) on a single
path. This solution is referred to as single path trellis. The trellis structure of single path
trellis is the same as TESSA except the required bandwidth is allocated on a single path
instead of multipath. The allocation speed and success rate of TESSA NoCManagers are
compared to previous centralized and distributed allocation techniques for diﬀerent NoC
sizes with diﬀerent slot table sizes under uniform random traﬃc. The source node sends
the connection request to NoCM over dedicated wires, and the allocation information
from NoCM to source is delivered via NoC as GS packet. We also evaluate the inﬂuence
of splitting the link into diﬀerent time slots, and allowing detours of diﬀerent hops. These
results are explained in the following sections.
For evaluation several performance metrics are used:
• success rate denotes the ratio of successful requests that established paths with
suﬃcient bandwidth to the total requests.
• background traﬃc refers to the certain percentage of slots which are already ran-
domly marked as occupied to exclude these slots from path search, same as in
[SNG12]. In our experiments for each router, equal number of slots are occupied,
but which slots are occupied is randomly selected.
• allocation time denotes the number of clock cycles that the algorithms need to ﬁnd
out a solution or to determine that the allocation is not possible.
• total allocation time denotes the number of clock cycles that the algorithms need
to ﬁnd the solution, in addition to the time to send the allocation information to
source node.
3.4 Performance Evaluation of Forward-Backtrack trellis 39
1 2 3 4 5 6
24
8
12
1500
2,000
3,000
4,000
5,000
6,000
7,000
8,000
9,000
#hops
A
llo
ca
tio
n 
Ti
m
e/
 n
s
 
 exhaustive path search Microblaze, 0% bk
exhaustive path search Microblaze, 10% bk
exhaustive path search Microblaze, 20% bk
Folded TESSA
Figure 3.20: Allocation speed compared to Microblaze software-based
approach[SNG12] with diﬀerent background in 4x4 NoC with slot table size
of 16.
• GS oﬀered load refers to the required data transfer per connection multiplied by the
connection request rate per master. Suppose the connection request rate per master
is 1/2000 per cycle, and each connection can deliver 200 ﬂits of data after setup,
then the oﬀered load is 200/2000=0.1 ﬂits/cycle. It is a measure of the traﬃc each
master oﬀered compared to its maximum bandwidth.
3.4.2.1 Comparison with centralized exhaustive path-search
We compare folded TESSA against the single path exhaustive path-search that runs on
Microblaze processor (@288 MHz) [SNG12] in this section.
Comparison of allocation speed We compare the allocation speed against the ex-
haustive path-search with 0%, 10% and 20% random background traﬃc (bk) in 4x4 mesh
network. At each node, the exhaustive path-search algorithm has to try diﬀerent direc-
tions one by one until gets the free path to next hop. Under heavy background traﬃc,
it has to try more directions to get the free path. Hence, with diﬀerent background traf-
ﬁc, the allocation time of exhaustive path-search is diﬀerent. However, since the TESSA
approach searches all directions simultaneously, all the possible paths are searched con-
currently, and thus the allocation time of TESSA depends on the number of hops but is
independent of background traﬃc.
From the Fig. 3.20, we can see the allocation time of the exhaustive path-search increases
linearly with the length of the paths without background traﬃc (0% bk), which increases
exponentially with background traﬃc (10% and 20% bk), while the allocation time of
40 3 Centralized Connection Allocation for TDM CS NoCs
0 2 4 6 8 10 12 14 16
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Requested BW (slots)
S
uc
ce
ss
 R
at
e
 
 
10% bk,single path trellis
20% bk,single path trellis
30% bk,single path trellis
50% bk,single path trellis
10% bk,multipath folded TESSA
20% bk,multipath folded TESSA
30% bk,multipath folded TESSA
50% bk,multipath folded TESSA
10% bk,single path exhaustive search
20% bk,single path exhaustive search
Figure 3.21: Success Rate compared to single path solutions in 4x4 NoC
with diﬀerent background traﬃc with Slot Table Size of 16.
TESSA always increases linearly with the length of the paths. The speed of TESSA is
about hundreds to thousand times faster than exhaustive path-search. For 6 hops, 12
cycles (12 ns @ 1GHz) are needed for TESSA, i.e. 6 cycles needed for forward search
and 6 cycles for backtracking, while 8848 ns are needed for exhaustive path-search 4 with
10% background traﬃc, which is 737 times faster. In the ﬁgure, it shows the exhaustive
path-search with 20% background traﬃc has shorter allocation time than that with 10%
background traﬃc. The reason is the higher background traﬃc induces lower success rate,
and thus the algorithm will determine early that no route is possible.
Comparison of Success Rate The requests sent to NoCM are generated in this way:
request to provide an allocation for every feasible source-destination pair combination with
certain percentage of background traﬃc. We produce 1000 samples at each background
traﬃc percentage. We do the simulation for 4x4 meshes with requested slots from 1 to 16
under background traﬃc from 10% to 50%.
The multipath folded TESSA, single path trellis, and single path software based exhaustive
path-search [SNG12] are compared in Fig. 3.21 and Fig. 3.22. The searching algorithms
of single path trellis and single path software approach are similar that the required
bandwidth is allocated over single path, so in the scenarios where the software method’s
results are not provided, we can imagine it is similar to single path trellis’s. The success
rate of single path trellis is higher than the software method, which is due to the reason
that it can detour when there is no minimal path, but in [SNG12] it only searches the
minimal path.
4 The hops in [SNG12] is adapted to the distance from router to router.
3.4 Performance Evaluation of Forward-Backtrack trellis 41
0 2 4 6 8 10 12 14 16
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Requested BW (slots)
S
uc
ce
ss
 R
at
e
 
 
20% bk,single path trellis
30% bk,single path trellis
40% bk,single path trellis
50% bk,single path trellis
20% bk,multipath TESSA
30% bk,multipath TESSA
40% bk,multipath TESSA
50% bk,multipath TESSA
Figure 3.22: Success Rate compared to single path approach in 8x8 NoC
with diﬀerent background traﬃc with Slot Table Size of 16.
From Fig. 3.21 and Fig. 3.22, we can see the success rate decreases when requested band-
width or background traﬃc increase, as expected. The success rate of folded TESSA can
be several to hundred times higher than single path trellis, and it is even higher than
single path exhaustive path-search. Under heavy background traﬃc with high requested
bandwidth, the TESSA is far superior to the two single path solutions. For example, in
4x4 mesh with 16 requested slots, under 20% background traﬃc, the success rate of folded
TESSA is 32X higher than single path trellis and 49X higher than exhaustive path-search,
which increases to 103X higher than single path trellis under 50% background traﬃc. In
8x8 network with 16 requested slots, under 50% background, the success rate of TESSA
is 0.074 while in single path trellis is only 0.0002, which is about 371 times higher.
3.4.2.2 Comparison with distributed parallel probe search
Bidirectional unfolded TESSA is compared to the state of the art distributed parallel
probe search [LJL14b] in this section. In distributed parallel probe search, the source
node sends a setup ﬂit for searching path that traverses through the NoC along all min-
imal paths to try to reach target node. It is a ﬂood-based algorithm which eliminates
redundant incoming paths. Each point in the plot in the ﬁgures is obtained from sim-
ulation of 1 million cycles. The master issues a connection request to the NoCM in a
uniform random traﬃc meeting the requirements of the GS oﬀered load. The ﬁrst and
last 100,000 simulation cycles were not considered in order to prevent transient eﬀects.
The connection lifetime, i.e. the number of ﬂits that each connection delivers, is set as 100
ﬂits, 200 ﬂits and 500 ﬂits. During simulation, half of the nodes are assumed as masters
42 3 Centralized Connection Allocation for TDM CS NoCs
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45
0
20
40
60
80
100
120
140
160
180
200
GS offered load
A
ve
ra
ge
 to
ta
l a
llo
ca
ito
n 
tim
e 
in
 n
an
os
ec
on
ds
 
 
probe search,6x6
Bidirectional TESSA,6x6
probe search,8x8
Bidirectional TESSA,8x8
probe search,16x16
Bidirectional TESSA,16x16
Figure 3.23: Allocation speed of bidirectional TESSA compared to probe
search in diﬀerent networks with diﬀerent GS oﬀered load with Slot Table
Size of 16.
and half of the nodes are assumed as slaves. The master nodes are uniformly randomly
distributed in the system. The source-destination pairs are uniform randomly selected
that each source-destination pair has equal probability to be chosen.
Comparison of allocation speed In distributed parallel probe search, multiple trials
might be needed before the eventual success of search due to the investigation of single
slot at a time. On the contrary, in bidirectional TESSA, all slots are being searched in
parallel, which completes the search in two clock cycles independent of the number of
slots. Though our design needs additional time to send the allocation information to GS
source, the path from NoCM to source node is found in two cycles by NoCM as a GS path.
If the allocation of GS path from NoCM to source node fails, the allocation information
will be sent to source as best-eﬀort packets. Because in this simulation setting the GS
oﬀered load is not high (the oﬀered load is lower than 0.145 in 16x16 mesh and the oﬀered
load is lower than 0.415 in 6x6 mesh), the corresponding allocation success rate is higher
than 0.999, so the inﬂuence of the allocation failure of GS path from NoCM to source
could be negligible.
For example, in 8x8 mesh with slot table size of 16 at GS oﬀered load 0.3, the average
total allocation time for single slot in probe search is 150 time slots (0.5 · 150 = 75ns)
5. However, in TESSA only 2 clock cycles each are needed for ﬁnding the requested GS
5 time slot is the routing time per router, in [LJL14b] a slot is 0.5 ns. We can assume the time slots in
both systems are the same.
3.4 Performance Evaluation of Forward-Backtrack trellis 43
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.4
0.5
0.6
0.7
0.8
0.9
1
GS offered load
S
uc
ce
ss
 R
at
e
 
 
probe search,16 slot,6x6
TESSA,16 slot,6x6
probe search,16 slot,8x8
TESSA,16 slot,8x8
probe search,16 slot,16x16
TESSA,16 slot,16x16
Figure 3.24: Success Rate of Bidirectional TESSA compared to probe search
in diﬀerent networks. Each connection delivers 200 ﬂits.
path and for ﬁnding the path from NoCM to source. As NoCM is connected to the center
node of the NoC, on average 4 time slots are needed by the NoCM to send the allocation
information to source. With a slot table size of 16, there is on average 8 time slots waiting
time at the NoCM to get its turn in the TDM scheme. So in total on average 12 time
slots (= 4 + 8) in addition to 4 cycles (5.6ns @critical path 1.4ns) are needed, which
is 11.6 ns (= 12 · 0.5 + 5.6ns), which is 546% faster than distributed parallel probe
search. In general, in N ·N mesh with slot table size of S, the average allocation time
is 4 cycles + (N+S2 )time slots. When n slots are requested, the allocation time for
probe search might be increased by n times, but the allocation time for TESSA will be
the same as it is independent of the number of requested slots. Hence, when more slots
are requested, our solution will present even better results.
Fig. 3.23 shows the average total allocation time when single slot is requested for diﬀerent
network sizes. When the GS oﬀered load is low, the allocation speed of TESSA is similar to
distributed parallel probe. However, when the oﬀered load increases, the allocation speed
of TESSA becomes much faster than distributed parallel probe. Compared to distributed
parallel probe search, our approach can provide up to 710% higher speed in 6x6 mesh
(@oﬀered load 0.4), up to 647% higher speed in 8x8 mesh (@oﬀered load 0.3), and up to
650% higher in 16x16 mesh (@oﬀered load 0.14). In distributed parallel probe, after the
saturation point of the network (oﬀered load 0.14 in 16x16 mesh and oﬀered load 0.41 in
6x6 mesh), the allocation time will increase dramatically. And consequently our approach
will be far superior to distributed parallel probe.
44 3 Centralized Connection Allocation for TDM CS NoCs
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
GS offered load
S
uc
ce
ss
 R
at
e
 
 
probe search,8 slot,100 flits,6x6
TESSA,8 slot,100 flits,6x6
probe search,16 slot,100 flits,6x6
TESSA,16 slot,100 flits,6x6
probe search,16 slot,500 flits,8x8
TESSA,16 slot,500 flits,8x8
Figure 3.25: Success Rate compared to probe search in 6x6 and 8x8 mesh
networks with 8 or 16 slot table size. Each connection delivers 100 or 500 ﬂits.
Comparison of Success Rate We re-implement the distributed parallel probe search
according to Liu’s work [LJL14b] for comparison, with retry deadline as 200 cycles. In
distributed parallel probe, each node is attached with a buﬀer to store the incoming
requests, and the buﬀer size is equal to the slot table size. The source node keeps retrying
a request until it succeeds or the deadline is exceeded, or the buﬀer is full.
In distributed parallel probe search, in which when several connections are requested si-
multaneously, the concurrent searches might block each other. It only searches the minimal
path, and cannot make detours as in TESSA. Retry before deadline policy is employed,
which can stop the search as failure before all slots are investigated even though there
might be available paths. As in its simulation setting, the deadline is 200 cycles, and
2l+S+6 cycles are needed for investigating single slot (l is the distance between source
and destination, S is the slot table size). Now assume the l is 10, S is 16, so only four
slots can be investigated before deadline, while in TESSA all 16 slots are investigated
simultaneously. According to these factors, our system would have much higher success
rate than distributed parallel probe search.
Fig. 3.24 and Fig. 3.25 shows the comparison results of success rate. From the simulation
results we can see that the success rate of our method is higher than probe search’s. E.g.
in 6x6 NoC at oﬀered load between 0.6 and 1.0 with slot table size of 16, our solution
oﬀers up to 26% higher success rate (@ connection lifetime of 100 ﬂits). And our solution
oﬀers up to 29% and 24% higher success rate in 8x8 and 16x16 NoC, respectively. From
Fig. 3.25 we can see with more slots (16 slots against 8 slots), the success rate becomes
higher, which is due to the increased path diversity.
3.4 Performance Evaluation of Forward-Backtrack trellis 45
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.7
0.75
0.8
0.85
0.9
0.95
1
GS offered load
S
uc
ce
ss
 R
at
e
 
 
4 slot,200 flits,6x6
8 slot,200 flits,6x6
16 slot,200 flits,6x6
4 slot,500 flits,8x8
8 slot,500 flits,8x8
16 slot,500 flits,8x8
Figure 3.26: Success Rate inﬂuence of split link into diﬀerent slots in 6x6
and 8x8 networks. Each connection delivers 200 or 500 ﬂits.
3.4.2.3 Inﬂuence of splitting the link into diﬀerent time slots on success rate
In this section, we split the link into diﬀerent time slots, e.g. 4 slots, 8 slots and 16 slots,
for 6x6 and 8x8 mesh networks to see the inﬂuence on success rate. In the simulation
setting, each connection delivers 200 ﬂits before release in 6x6 mesh and delivers 500 ﬂits
before release in 8x8 mesh. As the results in Fig. 3.26 show, when the link is split into
more time slots, it can provide higher success rate. For 6x6 mesh, when the link is split
into 16 slots, it can provide up to 9% higher success rate than 4-slot split, and up to 3%
higher than 8-slot split. For 8x8 mesh, when the link is split into 16 slots, it can provide
up to 10% higher success rate than 4-slot split, and up to 3.5% higher than 8-slot split.
The reason for this is when the link is split into more time slots, it can provide higher
path diversity, which can contribute to the higher success rate. For example, if the link
is split into S time slots, it can provide up to S diﬀerent options to route over this link,
and can support up to S ﬂows simultaneously.
3.4.2.4 Inﬂuence of allowing diﬀerent hops of detours on success rate
In this section, we evaluate the inﬂuence of allowing diﬀerent hops of detours on success
rate for 6x6 and 8x8 mesh networks. In the simulation setting, the unfolded trellis is
constructed as 10, 12, 14 and 16 stages for 6x6 mesh to allow 0, 2, 4 and 6 more hops
detours than the default trellis (by default, unfolded trellis is constructed as 2N − 2 for
N ·N mesh network, so 10 stages for 6x6 mesh), and it is constructed as 14, 17 and 20
stages for 8x8 mesh to allow 0, 3 and 6 more hops detours than the default trellis.
As the simulation results in Fig. 3.27 and Fig. 3.28 show, when more hops of detours
46 3 Centralized Connection Allocation for TDM CS NoCs
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Background traffic load
S
uc
ce
ss
 R
at
e
 
 
6hops, 16 slots
4hops, 16 slots
2hops, 16 slots
0hops, 16 slots
6hops, 8 slots
4hops, 8 slots
2hops, 8 slots
0hops, 8 slots
Figure 3.27: Success rate under diﬀerent hops of allowed detours in 6x6
network with 8 or 16 slot table size.
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Background traffic load
S
uc
ce
ss
 R
at
e
 
 
6hops, 16 slots
3hops, 16 slots
0hops, 16 slots
6hops, 8 slots
3hops, 8 slots
0hops, 8 slots
Figure 3.28: Success rate under diﬀerent hops of allowed detours in 8x8
network with 8 or 16 slot table size.
allowed, it can provide higher success rate. For 6x6 mesh, with 8 time slots, the 16 stages
trellis (6 more hops detour allowed) can provide up to 4% higher success rate than 12
stages trellis (2 more hops detour allowed), and up to 10% higher success rate than 10
stages trellis (0 more hops detour allowed); with 16 time slots, the 16 stages trellis (6
more hops detour allowed) can provide up to 4% higher success rate than 12 stages trellis,
and up to 7.5% higher success rate than 10 stages trellis. For 8x8 mesh, with 8 time slots,
20 stages trellis (6 more hops detour allowed) can provide up to 10% higher success rate
3.5 Register-Exchange Trellis Search 47
0 1
2 3
Figure 3.29: 2x2 2D-mesh example NoC.
than 14 stages trellis (0 more hops detour allowed); with 16 time slots, 20 stages trellis
can also provide up to 10% higher success rate than 14 stages trellis.
3.5 Register-Exchange Trellis Search
In previous forward-backtrack trellis search, the path search is divided into two steps:
forward search and backtrack. In this section, we present the register-exchange trellis that
saves the entire survivor path during the forward search, so that when the destination
node is reached during the forward search, the entire survivor path can be read out from
the destination node directly [Kam07, Fet95]. Consequently, the backtrack is omitted and
the search time is halved compared to previous forward-backtrack approaches, which can
also contribute to the allocation success rate.
3.5.1 Register-Exchange Trellis Path Search Algorithm
In Register-Exchange (RE) trellis search, each state forwards the entire survivor path (i.e.
the path from the initial state to the current state) to neighbors at the next stage. The
entire information sequences of the survivor paths are then continuously updated during
path search. As long as the target state is reached, the entire survivor path sequence can
be read directly from the target state. The survivor path is the shortest contention-free
path between initial state and target state, which is to be found. The RE trellis can also
adopt the unfolded structure, folded structure and bidirectional structure. In this section,
we take the unfolded RE trellis and folded RE trellis as example.
3.5.1.1 Unfolded Register-Exchange Trellis Path Search
The unfolded trellis graph of example network in Fig. 3.29 is illustrated in Fig. 3.30. The
search begins at the source node at the initial stage, traversing through the trellis to try to
activate its all connected neighbors. The neighbors can be activated only if the associated
incoming branch is available. The new active node will continue the propagation search to
48 3 Centralized Connection Allocation for TDM CS NoCs
0
1
2
3
0
1
2
3
0
1
2
3
Time slot index
n mod S (n+1) mod S (n+2) mod S
stage
Path search
Survivor 
path
023
01
02
023
Failed 
search
Src
Des
Figure 3.30: Unfolded RE trellis path search. The survivor path is read
directly from destination node without backtrack.
0
1
2
3
0
1
2
3
regSrc
Des
0
02
023
01
Path search
Survivor 
path
023
Iteration link
Figure 3.31: The folded RE trellis search graph of the example NoC.
activate its neighbors, and forwards the associated survivor path sequence to its neighbors.
The search will be stopped when the last stage is reached.
In Fig. 3.30, assume node 0 is source node and node 3 is destination node. The search
starts from source and activates its neighbors (node 2 and node 1), and the activated
3.5 Register-Exchange Trellis Search 49
Detect Select
Survivor 
register
input
output
Figure 3.32: Block diagram of single state.
neighbors update their survivor path sequence (node 2 as {02} and node 1 as {01}). The
active neighbors continue the propagation, e.g. node 2 activates node 3, and then node
3 updates its survivor path sequence as {023}. Simultaneously, the node 1 also tries to
reach node 3. Assume the branch N1→ N3 is already allocated by previous allocation,
then node 3 cannot be activated by node 1 at this time. When target node (node 3) is
reached, the survivor path is read directly from target node as N0 → N2 → N3.
Assume the beginning slot at source is t, then we can obtain the slot sequence along the
path as {t, (t+ 1) mod S, (t+ 2) mod S}.
3.5.1.2 Folded Register-Exchange Trellis Path Search
Proposed unfolded trellis graph can be eﬃciently mapped on the folded structure. To
enable iterative traversals to represent multi-hop search, there is ‘iteration link’ from
second stage to ﬁrst stage. A register is assigned to each state that stores the survivor
path. At the target state, as long as it is reached, the corresponding survivor path is the
one stored in the survivor path register. Each iteration consumes a clock cycle. The folded
structure can reduce the resource cost signiﬁcantly, while consuming longer search time.
The folded path search algorithm of example network in Fig. 3.29 is illustrated in Fig. 3.31.
The search begins at the source node of the initial stage, traversing through the trellis to
try to activate its connected neighbors. When a node is active, its associated register will
remember the entire survivor path sequence. At the next cycle the active node will travel
back to its ﬁrst stage, and does the propagation search again. The search will be stopped
in two cases: i) either the target node has been reached or ii) after certain iterations
(by default, 2N − 2 iterations for NxN mesh NoC). Hence, the livelock is avoided. As
shown in Fig. 3.31, the search signals started from source and activated node 2, then
node 2 updated its survivor path register as {02}. At the next cycle, node 2 continued to
activate node 3, and node 3 updated its survivor path register as {023}. When node 3 is
reached, the path is read from its survivor register as N0→ N2→ N3.
3.5.2 Trellis Path Search Implementation
The implementation block diagram of single state of a RE trellis can be divided into three
basic units, as shown in Fig. 3.32. The input data is used in the detect unit to detect which
incoming branches are active. These are then fed to the select unit which selects one of
the incoming branches as survivor path, and accumulates the survivor path. Thereafter,
50 3 Centralized Connection Allocation for TDM CS NoCs
M
U
X
M
U
X
Survivor 
reg
M
U
X
M
U
X
M
U
X
M
U
X
M
U
X
M
U
X
0
1
2
3
su
rv
iv
o
rs
Figure 3.33: Implementation schematic of the folded RE trellis
the survivor register saves the accumulated survivor path, and outputs it to neighbors at
next stage.
The implementation schematic structure of Fig. 3.31 for the example 2x2 mesh NoC is
shown in Fig. 3.33. The survivor path scales with the number of stages and nodes, and
associated storage is necessary. Each state has a register (of size (2N − 2) · log2N2 bits
for NxN mesh) to store the whole survivor path. There is also a ‘branch state’ register for
each branch, which stores the actual branch allocation state. When a branch is allocated
at a speciﬁc slot, its corresponding state register is set to ‘0’ excluding it thereby from
future search. Correspondingly, state register is set to ‘1’ when the branch is released. For
each state, one and only one of the available incoming branches is selected as survivor path
and updated in the survivor register. The detection of the available incoming branch is
as: if the search signal from source node reaches this branch as well as (‘AND’ operation)
its corresponding ‘branch state’ register is valid, this incoming branch is available. The
time slot shift (slot shifts from t to (t + 1) mod S after each stage) is realized by wire
connection. When the target node is reached, the entire path sequence is read directly
from its associated survivor register.
3.5 Register-Exchange Trellis Search 51
10 20 30 40 50 60 70 80 90 100
0
20
40
60
80
100
120
#Routers
A
re
a/
 1
04
 µ
m
2
 
 
#slots=4,Folded FB TESSA
#slots=8,Folded FB TESSA
#slots=4,Folded RE−TESSA
#slots=8,Folded RE−TESSA
Figure 3.34: Area of folded FB and folded RE NoCManagers in diﬀerent
size NoCs with diﬀerent slot table sizes.
3.5.3 Performance Evaluation
In this section, we present the synthesis and simulation results of folded RE trellis against
the folded FB trellis and against distributed parallel probe search [LJL14b].
3.5.3.1 Synthesis Results
The NoCManager is available in synthesizable VerilogHDL and can be generated out of
an XML description for diﬀerent NoC sizes. Using Synopsys Design Compiler, the NoCM
was synthesized with TSMC 65 nm technology, for diﬀerent mesh networks of size 4x4
to 9x9. Both the FB and RE folded TESSA NoCM were synthesized with 1 GHz clock
frequency constraints. 6
From the area consumption shown in Fig. 3.34, we can see the area of RE TESSA grows
with O(S· M) in 2D-mesh (M= #routers, S= #slots). Since the RE-TESSA has to use a
larger register ﬁle (of sizeN ·(2N−2)·log2N2 bits) , its area is about twice as the folded
FB TESSA. The average AT complexity in Fig. 3.35 shows, when the NoC is small (smaller
than 7x7 mesh), the AT complexity of RE and FB TESSA is similar. However, when the
network size grows, the AT complexity of FB TESSA becomes better than RE approach.
The average energy consumption per allocation in Fig. 3.36 shows, RE TESSA consumes
less energy than FB TESSA in small NoC (smaller than 8x8 NoC), while consuming more
energy in large NoC. The reason for this is, the RE TESSA takes less search time as
6 Since in RE TESSA there is more data to be forwarded to next stage, for large NoC, the clock frequency
of RE TESSA could be lower than FB TESSA.
52 3 Centralized Connection Allocation for TDM CS NoCs
10 20 30 40 50 60 70 80 90
0
100
200
300
400
500
600
700
800
#Routers
A
T 
pe
r 
al
lo
ca
tio
n 
(1
04
 µ
m
2 /
G
H
z)
 
 
#slots=4,Folded FB TESSA
#slots=8,Folded FB TESSA
#slots=4,Folded RE−TESSA
#slots=8,Folded RE−TESSA
Figure 3.35: Average AT complexity per allocation of RE and FB NoCMan-
agers in diﬀerent size NoCs with diﬀerent slot table sizes.
10 20 30 40 50 60 70 80 90
0
200
400
600
800
1000
1200
#RoutersE
ne
rg
y 
co
ns
um
pt
io
n 
pe
r 
al
lo
ca
tio
n 
(P
ic
oJ
ou
le
)
 
 
#slots=4,Folded FB TESSA
#slots=8,Folded FB TESSA
#slots=4,Folded RE−TESSA
#slots=8,Folded RE−TESSA
Figure 3.36: Average Energy consumption per allocation of RE and FB
NoCManagers in diﬀerent size NoCs with diﬀerent slot table sizes.
there is no backtrack, so it could oﬀer better energy eﬃciency than FB TESSA. But when
the NoC size increases, the survivor registers in RE TESSA increase dramatically, so the
energy consumption in RE TESSA increases faster than in FB TESSA. After certain point,
the energy consumption in RE TESSA could be higher than FB TESSA. However, due to
3.5 Register-Exchange Trellis Search 53
1 2 3 4 5 6 7 8 9 10 11 12 13 14
0
5
10
15
20
25
30
#hops
A
llo
ca
tio
n 
Ti
m
e 
(c
yc
le
s)
 
 
FB TESSA,8x8
RE TESSA,8x8
Figure 3.37: Allocation speed comparison between RE TESSA and FB
TESSA.
the partitioning architecture idea that divides the large system into multiple small logic
partitions with multiple managers (explained in section 3.7), each NoCM only manages
limited number of nodes in its local region, and thus the energy consumption of RE
TESSA could be better than FB approach (if each partition size is smaller than 8x8).
3.5.3.2 Simulation Results
The allocation speed and success rate of RE TESSA NoCManagers are compared to the
FB TESSA and distributed parallel probe search [LJL14b] under uniform random traﬃc.
The request queue sizes of FB and RE NoCMs are both 64 ﬂits deep, and larger queue
size did not show signiﬁcant performance improvements in our continuous simulations.
We also re-implement a distributed parallel probe connection setup approach according
to Liu’s work [LJL14b] for comparison, with a retry deadline of 300 clock cycles attached
to each request. Each node is attached with a buﬀer to store the incoming requests, with
the buﬀer size equal to the slot table size.
The performance metrics, success rate, allocation time and GS oﬀered load are explained
in previous section.
Any data point that is shown in the ﬁgures comes from simulation of 1 million cycles. The
NoC issues a GS connection request to the NoCM in a uniform random traﬃc meeting the
requirements of the GS oﬀered load. The ﬁrst and last 100,000 simulation cycles were not
considered in order to prevent transient eﬀects. The connection lifetime, i.e. the number
of data ﬂits transmitted over established connection, is set as 100 ﬂits, 200 ﬂits, 300 ﬂits
and 500 ﬂits.
54 3 Centralized Connection Allocation for TDM CS NoCs
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
GS offered load
S
uc
ce
ss
 R
at
e
 
 
probe search,200 flit,16 slot
FB TESSA,200 flit,16 slot
RE TESSA,200 flit,16 slot
probe search,100 flit,16 slot
FB TESSA,100 flit,16 slot
RE TESSA,100 flit,16 slot
FB TESSA,100 flit,8 slot
RE TESSA,100 flit,8 slot
Figure 3.38: Success Rate compared to FB TESSA and probe search in 6x6
network with Slot Table Size of 16 and 8. Each connection delivers 100 or 200
ﬂits.
Comparison of allocation speed Since RE TESSA omitted the backtrack step while
which is necessary in FB approach, its allocation speed is twice as in FB TESSA. The
RE TESSA needs one cycle to traverse single hop through the network, while two cycles
are required in FB approach, i.e. one cycle for forward search and one cycle for backtrack.
Since the RE TESSA has to forward the entire survivor path during the forward search,
it may have longer critical path than FB approach. However, until 8x8 mesh, the critical
path of RE TESSA is constrained to 1 ns, which is as the same as in FB approach. Due
to the partitioning architecture, each NoCM only manages limited number of nodes in its
local region, and thus the critical path of RE TESSA would not increase too much.
As shown in Fig. 3.37, to traverse n hops, for RE TESSA, only n cycles are needed in
NoCM to ﬁnd the path, but 2 · n cycles are necessary in FB TESSA to ﬁnd the path.
Hence, the allocation speed of RE TESSA is doubled over FB approach.
Comparison of Success Rate In distributed parallel probe search, in which when
several connections are requested simultaneously, the concurrent search ﬂits might block
each other. Moreover, it only searches the minimal path, cannot detour as in TESSA. For
3.5 Register-Exchange Trellis Search 55
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
GS offered load
S
uc
ce
ss
 R
at
e
 
 
probe search,500 flit
FB TESSA,500 flit
RE TESSA,500 flit
probe search,300 flit
FB TESSA,300 flit
RE TESSA,300 flit
Figure 3.39: Success Rate compared to FB TESSA and probe search in 8x8
network with Slot Table Size of 16. Each connection delivers 300 or 500 ﬂits.
example, in Fig. 3.38 in 6x6 NoC, RE TESSA can oﬀer up to 25% higher success rate than
distributed parallel probe search. Compared to FB TESSA, when each connection delivers
200 ﬂits, the success rate of FB and RE TESSA are similar; while when each connection
delivers 100 ﬂits, the RE approach can achieve up to 22% higher success rate. The reason
for this is there are more new requests to NoCM when connection delivers 100 ﬂits than
connection delivers 200 ﬂits for the same oﬀered load. Therefore, many requests arriving
at the low speed FB NoCM must be rejected immediately because of a full request queue
although there might be a free path for this request. In Fig. 3.39, in 8x8 NoC, RE TESSA
can oﬀer up to 33% higher success rate than distributed parallel probe search. When each
connection delivers 300 ﬂits, RE approach can provide up to 22% higher success rate than
FB TESSA.
Comparison of Area.Time/Success Rate In previous section, we have shown the
AT complexity. However, due to the success rate that some failed allocations have to be
dropped, the eﬀective allocation time per allocation is increased according to Teff = TS ,
where S is success rate. Therefore, the eﬀective ATeff = ATS . This can be further
decomposed into true cost and overhead:
ATeff = AT (1 +
E
S
) = AT +AT
E
S
Note:
1
S
=
S + E
S
= 1 +
E
S
56 3 Centralized Connection Allocation for TDM CS NoCs
0.2 0.4 0.6 0.8 1
150
200
250
300
350
400
450
GS offered load
A
T/
S
 p
er
 a
llo
ca
tio
n 
(1
04
 µ
m
2 /
G
H
z)
 
 
FB TESSA,200 flit,16 slot
RE TESSA,200 flit,16 slot
FB TESSA,100 flit,16 slot
RE TESSA,100 flit,16 slot
FB TESSA,100 flit,8 slot
RE TESSA,100 flit,8 slot
Figure 3.40: Average Area.Time/Success Rate (per allocation) in 6x6 net-
work with Slot Table Size of 16 and 8. Each connection delivers 100 or 200
ﬂits.
Where E is the error rate. Hence, the overhead scales with E
S
and this ratio is speciﬁc for
each algorithm and implementation.
In this section, we show the measurement of AT/S (Area.Time/Success Rate) for FB and
RE TESSA in 6x6 mesh, as shown in Fig. 3.40. The results show when the oﬀered load
becomes higher, since the corresponding success rate becomes lower, the value of AT/S
becomes higher. When each connection delivers 200 ﬂits, the AT/S of FB TESSA is better
than RE TESSA. When each connection delivers 100 ﬂits, if the oﬀered load is higher
than 0.9, the AT/S of RE TESSA would be better than FB TESSA.
3.6 Single Layer Trellis
In previous sections, every slot at the initial stage has its own layer of the trellis graph, i.e.
if the size of the slot table is S, there are S layers of trellis graph in parallel, as in Fig. 3.42.
We name this as multi-layer approach. In this section, we present the single-layer approach,
in which only one layer of the trellis graph need to be implemented. Moreover, all slots at
3.6 Single Layer Trellis 57
0 1
2 3
Figure 3.41: 2x2 example NoC.
Time slot index
(n+t) mod S (n+1+t) mod S (n+2+t) mod S
n n+1 n+2
stage
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
S layers
Initial slot 0
Initial slot 1
Initial slot S-1
0
1
2
3
0
1
2
3
0
1
2
3
Figure 3.42:Multiple-layer trellis of the example NoC. Each slot at the initial
stage has its own layer.
the initial node (source node) can share the single layer, and can search simultaneously
in the single layer. Compared to previous approaches that multiple layers of trellis graph
has to be implemented, the hardware consumption is reduced dramatically.
3.6.1 Single-layer Trellis Path Search Algorithm
The advantage of the multi-layer approach is that every slot from the source node can
search its own path in parallel, so it can allocate several slots to a single ﬂow simulta-
neously. However, in some scenarios maybe most ﬂows only need one portion of the link
58 3 Centralized Connection Allocation for TDM CS NoCs
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
Time slot index
0 1 S-1
stage
0
1
2
3
0 mod S 1 mod S
Initial search
1 1 1
Figure 3.43: General schematic structure of the single-layer approach. The
slot table size is S, so there are S-1 additional stages. The ﬁrst S stages are
associated with S time slots, which can launch S initial searches simultaneously
at the source node.
bandwidth, i.e. one time slot, so the allocation of several slots in parallel to single ﬂow
may become unnecessary. Consequently, we do not always need to implement S layers of
trellis graph, but sometimes we can only implement single layer of trellis to be shared by
all slots. We name this as single-layer approach.
The single-layer approach is shown in Fig. 3.43. There is only one layer of trellis graph
that is shared by all slots. Now, the trellis is constructed as S − 1 + 2N − 2 stages for
NxN mesh NoC, i.e. S − 1 stages longer than the default unfolded trellis. The ﬁrst S
stages of the trellis are used to launch S initial searches at source node, that each stage
is associated with a time slot and can launch a search. Hence, all slots at the source node
can search simultaneously in the single layer.
An example of single-layer is shown in Fig. 3.44 for the example 2x2 mesh network. Node
0 is the source and node 3 is the destination. The size of the slot table is 2, so the trellis
has one additional stage than the default unfolded trellis. At the ﬁrst stage, the slot 0
from the source starts the search and activates node 2. At the second stage, the slot 1
from the source also sends out a search and activates node 1. The searches are forward
propagated until reaching the last stage. At the last stage, the destination are activated by
both node 1 and node 2, and the node 1 is selected by the destination as the predecessor.
At the next cycle, the backtrack starts to select the survivor path as N0→ N1→ N3,
and the corresponding slot sequence along the path is acquired as {1, 0, 1}.
3.6 Single Layer Trellis 59
0
1
2
3
0
1
2
3
0
1
2
3
Time slot index
0 1
stage
0
1
2
3
0 1
Initial search
Src
Des
Forward Search
Backtrack
Link unavailable, 
search failed
Figure 3.44: Single-layer path search example.
3.6.2 Single-layer Trellis Path Search Implementation
The implementation schematic of single state of single-layer is shown in Fig. 3.45. Each
state is implemented as Detect-Select (DS) Unit, which evaluates the incoming search
signals to determine whether this state can be activated, and forwards the active search
to the next stage. The actual branch state (link allocation state at a speciﬁc slot) is stored
in the ‘Branch State’ register. When a branch is allocated, its associated state register is
set to ‘0’ to exclude it from later allocation. Correspondingly, state register is reset to ‘1’
to allow allocation when the branch is released.
The work-ﬂow of DS unit for each state is the same, which can be divided into two steps:
1. Detect the active incoming branches: the branch becomes active only when the
branch state register as well as (‘AND’ operation) the corresponding incoming search
are both valid. A node can be active as long as any incoming branch is active (‘OR’
operation).
2. Select one active branch as survivor path: one and only one of the active incoming
branches is selected as ‘survivor path’, saved in the ‘Survivor Path Register’. At the
same time, the active search is forwarded to the next stage.
3.6.3 Synthesis Results
Using Synopsys Design Compiler, the NoCM was synthesized with TSMC 65 nm technol-
ogy, for diﬀerent mesh networks of size 4x4 to 10x10. From the area consumption shown
60 3 Centralized Connection Allocation for TDM CS NoCs
DS =
AND
OR
N1 N2N3
slot 1 
branch state
register
N1    N3
search 
signal
AND
slot 1 
branch state
register
N2     N3
Survivor path
Register
Figure 3.45: The implementation schematic of node 3 at stage 1, so the
associated time slot is slot 1.
10 20 30 40 50 60 70 80 90 100
0
50
100
150
200
250
300
#Routers
A
re
a/
 1
04
 µ
m
2
 
 
#slots=4, 4−layer
#slots=8, 8−layer
#slots=16, 16−layer
#slots=4,Single Layer
#slots=8,Single Layer
#slots=16,Single Layer
Figure 3.46: Area of unfolded single layer and multi-layer NoCManagers in
diﬀerent size NoCs with diﬀerent slot table sizes.
in Fig. 3.46, we can see the area of single-layer NoCM grows with O(N+S) in 2D-mesh
(N= #routers, S= #slots). Because it only implements single layer rather than S layers
as in multi-layer approach, its hardware resource consumption is dramatically less than
the multi-layer approach. In NxN mesh with slot table size of S, for multi-layer approach,
each layer has 2N−2 stages and there are S layers, so the trellis stage number in total is
(2N−2) ·S; for single-layer, the trellis stage number is S−1+2N−2 = S+2N−3.
3.7 Partitioned Trellis Architecture 61
10 20 30 40 50 60 70 80 90 100
0
200
400
600
800
1000
1200
1400
1600
1800
2000
#Routers
E
ne
rg
y 
pe
r 
al
lo
ca
tio
n 
(P
ic
oJ
ou
le
)
 
 
#slots=4, 4−layer
#slots=8, 8−layer
#slots=16, 16−layer
#slots=4,Single Layer
#slots=8,Single Layer
#slots=16,Single Layer
Figure 3.47: Average Energy consumption per allocation of unfolded single
layer and multi-layer NoCManagers in diﬀerent size NoCs with diﬀerent slot
table sizes.
Hence, with large slot table size and large NoC size, the hardware eﬃciency of single-layer
will be remarkably better than multi-layer approach.
As Fig. 3.46 shows, for 10x10 mesh, compared to multi-layer approach, with 4 slots, the
single-layer NoCM can reduce the area by a factor of 2.5; with 8 slots, it can reduce
the area cost by a factor of 3.8; and with 16 slots, it becomes a factor of 5.1. The plot
of average energy consumption per allocation in Fig. 3.47 shows, single-layer consumes
much less energy than multi-layer approach. For 10x10 mesh, compared to multi-layer
approach, with 4 slots, the single-layer NoCM can reduce the energy consumption by a
factor of 1.7, which becomes a factor of 2.8 with 8 slots, and becomes a factor of 3.4 with
16 slots.
3.7 Partitioned Trellis Architecture
The centralized methods for connection allocation in circuit-switched NoCs may pose
serious performance and scalability issues in large-scale networks due to the
1. limited path search speed,
2. increasing allocation request rate at central unit,
3. and the increasing communication cost between the central unit and NoC nodes.
62 3 Centralized Connection Allocation for TDM CS NoCs
F
G Des
NoCM_C NoCM_D
Src A
B
C E D
NoCM_A NoCM_B
Partition A Partition B
Partition C Partition D
Figure 3.48: The original NoC is divided into 4 partitions with 4 dedi-
cated NoCMs. The green arrow: forward search, purple arrow: backtrack,
and the red arrow: communication among NoCMs. The border nodes A
and E are backtracked as survivor path. The cross-partition search is along
NoCM_A → (NoCM_B,NoCM_C) → NoCM_D. The path
search inside each partition is done as forward-backtrack trellis search, while
cross-partition search among NoCMs is as probe search.
We tackle this problem by proposing the partitioned architecture that divides the original
system into multiple partitions, and each partition has its own local manager. Each local
manager only stores the information of nodes in its region, and is responsible for searching
path in its local region. These local managers can work simultaneously, and they need to
communicate with each other only when the connection requests cross partitions, i.e. the
source node and destination are not in the same partition. Since the managers work simul-
taneously, the computation capacity is increased. As the NoC nodes only communicate
with their local managers, the communication overhead is mitigated. In this section, we
employ the folded forward-backtrack multi-layer approach to search path inside partitions.
3.7 Partitioned Trellis Architecture 63
3.7.1 Partitioned TESSA search algorithm
Due to the global knowledge of the system, the previous centralized approaches for con-
nection allocation provide good performance for small and moderate NoCs. However, as
the size of the NoC grows, the more allocation requests are received by the central unit per
cycle, and the higher is the communication cost between the central unit and NoC nodes.
Consequently, the central unit can be a serious bottleneck especially in hotspot traﬃc.
In this section, we proposed a scalable mechanism to address the dynamic connection
allocation problem in large systems. The partitioned architecture (i.e. spatial partitioning
technique) is used to overcome the scalability problem in traditional centralized systems.
NoC is divided into small non-overlapped logical partitions served by local NoCMs. This
partitioning technique keeps the request load of the manager and manager-node commu-
nication overhead moderate. Inside each partition, the path search problem is solved by
a local manager with trellis-search algorithm. To establish a path that crosses partitions,
the managers communicate with each other in distributed manner to converge the global
path. Hence, good scalability and high performance can be achieved at the same time.
The NoCMs are connected with each other in 2D-mesh topology via dedicated links. The
dedicated link width is 2 · log2M + log2(N · N) + N · S + 2 bits, where M is the
number of partitions (for destination and source NoCM), N ·N is partition size (for the
destination node and border nodes) and S is slot table size, and 2 bits for control signals.
The link width can be reduced, but then more cycles are needed to send single message.
The dedicated links among NoCMs are longer than normal links in the NoC, but authors
in [WPG10] claim that it is possible to route these links on high metal layer with reduced
RC-delay, and new techniques like high-speed serialized, LVDS on-chip signaling will be
available to allow long on-chip link without frequency degradation. These NoCMs can
work in parallel, and they need to exchange information only when requested connections
cross partitions.
If the source and destination nodes of the request are both at the same partition, this
request will be handled by the local manager only. Otherwise, as illustrated in Fig. 3.48,
ﬁrstly, the local NoCM (NoCM_A) starts the path search in its trellis graph from source
node to reach the border nodes (node A,B and C), and then forwards the search message to
its neighbor NoCMs (NoCM_B and NoCM_C), continually until reaches the destination
(NoCM_D). When destination node is reached, backtrack starts from destination to select
the survivor path. We can see multiple nodes on the border can be activated and set as
start nodes in the next trellis search, which is due to the feature of trellis search that
can support multiple start nodes and multiple destination nodes in one search without
additional cost.
The header of the search message among NoCMs contains the address of destination node,
source and destination NoCM. For the payload, in forward search it contains the activated
border nodes, and in backtrack it contains the selected border node.
64 3 Centralized Connection Allocation for TDM CS NoCs
Src
A
Des
(a) In each node a probe
may double.
Src
A
Des
X
X
Forward probe 
search
Cancellation
(b) When two probes meet, one is canceled.
Figure 3.49: The probe search among NoCMs. Each node represents a
NoCM.
Table 3.2: The usage of control signals
Signals Usage
00 Idle
01 Search comes in
10 Nack (Path search failed)
11 Ack (Path established)
The basic search idea among NoCMs is similar to parallel probe search [LJL14b], as
shown in Fig. 3.49. The source NoCM sends the searches to its all productive (i.e those
that lead closer to the destination) neighbor NoCMs. The reached neighbor performs
trellis search in its partition, and then forwards the message to its reached productive
neighboring NoCMs. Consequently, the search is forwarded to the destination along all
possible minimal paths. If two searches meet at one NoCM, then one of them will be
canceled based on RoundRobin arbitration. If the downstream NoCM is not available,
then that probe search will be canceled immediately without waiting. All channels setup
only by the canceled probe are released hop by hop.
3.7.2 NoCM architecture
The block diagram of the NoCM is shown in Fig. 3.50 and comprises trellis graph and
control modules (routing module and control logic). The details are explained in the
following sections.
3.7.2.1 Control signals
There are 2 bits control wires for 4 control signals for probe search among NoCMs, as
listed in table 3.2.
3.7 Partitioned Trellis Architecture 65
Forward 
search
Path 
Backtrack
Edge free/
allocated
deactivate
Trellis Graph
Search/Ans
 from NoCMs
Free edge
Ack, 
backtrack
Selected path
Search/Ans
to NoCMs
Response message
to NoC
Search
Reached 
border nodes
reached 
Des node
 from NoC
NoCManager
DEMUX
Deallocation
Request
Search/Ans
Switch
Arbiter
inPortSelect
Target outport
Routing
Control logic
search_cnt
Control signal
Figure 3.50: Block diagram of the NoCManager
The ‘Search comes in’ indicates a new probe search comes in, and Nack/Ack are backward
answer signals (Ans). The initial ‘Search comes in’ is generated by source NoCM and the
initial Ack is generated by destination NoCM. The detailed control signals usage during
probe search is illustrated in Fig. 3.51. For each NoCM, if it accepts a search, it will become
busy and reject any later search until becomes free again. Since each search may have two
productive output directions and may send out two searches, a counter (search_cnt) is
used to record the number of sent out searches. Hence, the value of the counter will be
decreased when it receives an Ans signal from the downstream NoCM. The NoCM sends
Ack to its upstream NoCM as long as a Ack signal from downstream NoCM is sent back,
while sending out Nack only when it receives all Ans signals (search_cnt=0) and they are
all Nack. The busy NoCM will become free again in two cases: i) either receives the Ack
or ii) receives all Ans signals (search_cnt=0). The search procedure is detailed in Fig. 52.
3.7.2.2 Control trellis path search in each partition
If the NoCM is reached by a probe search, it will start the path search in its trellis graph
to search the path inside its region. The forward trellis search will stop in three cases:
either reached the border of non-busy productive neighboring NoCMs, or reached the
destination node (when the destination node is in this partition), or after certain cycles
66 3 Centralized Connection Allocation for TDM CS NoCs
Src A
Des
1. Search
X3.Ack
Initial state: Busy=0, search_cnt=0;
State 1: Busy=1, search_cnt=2;
State 2: Busy=1, search_cnt=1;
State 3: Busy=0, search_cnt=0, send upstream Ack.
Forward probe 
search
Nack
Ack
Figure 3.51: At the beginning, NoCM A is free. At state 1, probe search
comes, NoCM A becomes busy. At state 2, Nack comes. At state 3, Ack
comes.
forward probe
Src
Local 
NoCM_0
Path 
search
Succeed?
Reach 
Des?
No
Backtrack 
path
Yes
Allocation info
neighbor 
NoCM_1
Path 
search
Succeed?
Reach 
Des?
Backtrack 
path
Yes
Yes
No, failed
Ack
Search 
control
Yes
No
Ans 
come?
No
Ack?
Yes
Yes
search_cnt 
=0?
path failed No
No
Yes
Nack
No
Search 
control
Ans 
come?
No
Ack?
Yes
Yes
search_cnt 
=0?
No
No
YesNack
request
Des 
NoCM_n
Path 
search
Succeed?
Reach 
Des?
Backtrack 
path
Yes
Ack
Yes
No
Nack
No
Search 
control
Ans 
come?
No
Ack?
Yes
Yes
search_cnt 
=0?
No
No
YesNack
forward probe
forward probe
Figure 3.52: The search procedure of partitioned TESSA.
(i.e. 2N − 2 cycles for NxN mesh partition). The path backtrack in trellis will start in
two cases: i) either an Ack is sent back, or ii) the destination node is reached in this trellis
graph. In forward probe search, message of the reached border nodes will be sent to the
corresponding downstream NoCMs; and in backtrack, only that of the backtracked border
node needs to be sent to the upstream NoCM. The connection inside each region is set
up by its local NoCM.
3.7 Partitioned Trellis Architecture 67
3.7.2.3 Ensuring that the destination is activated only once by one request
One request can send out several searches along diﬀerent paths to reach the destination,
and diﬀerent paths may take diﬀerent time. In order to avoid allocating several overlapped
paths, it should guarantee the destination only be activated by the ﬁrst search, and will
not be activated by the later searches of the same request any more. This is guaranteed
by request ID. Each destination NoCM has a table that stores the ID of last request from
each source NoCM. The searches belonging to the same request are assigned the same
ID. So when the destination NoCM receives a search, the ID of the received search is
compared to the ID of last search from that source NoCM. If these two IDs are the same,
it means they belong to the same request and the newly received one will be rejected.
Otherwise, the new search will be accepted and the ID table will be updated. Hence, for
each request, it ensures there is at most one Ack that is sent back.
Since the forward probe search is along minimal paths and does not wait, the livelock and
deadlock are avoided.
3.7.3 Performance Evaluation
In this section, we present the synthesis and simulation results of partitioned trellis against
the non-partitioned trellis and distributed parallel probe search [LJL14b], and also provide
an intuitive idea that how to partition the system.
3.7.3.1 Synthesis Results
The NoCManager is available in synthesizable VerilogHDL and can be generated out
of an XML description for diﬀerent NoC sizes. Using Synopsys Design Compiler, the
NoCM was synthesized with TSMC 65 nm technology. Both the non-partitioned and
partitioned NoCMs were synthesized with 0.5 GHz clock frequency constraints. For par-
titioned TESSA, in 18x18 mesh, it is divided into 4 partitions (four 9x9 mesh partitions)
or 9 partitions (nine 6x6 mesh partitions); in 16x16 and 20x20 meshes, it is divided into
4 partitions or 16 partitions.
The area consumption is illustrated in Fig. 3.53. It might be surprising that the total area
of partitioned TESSA is less than non-partitioned TESSA. The reason is that in non-
partitioned TESSA, single trellis graph contains the whole network, while in partitioned
TESSA each trellis only contains the local region nodes. Hence, the non-partitioned trellis
is much larger than that of partitioned architecture, which induces more eﬀort for wiring
and more ﬂip-ﬂops to distinguish diﬀerent nodes. On the other side, in partitioned TESSA,
the more partitions we have, the more control logic is required, so using more partitions
may bring the increase of logic area. Therefore, the partitioned TESSA with 4 partitions
costs the least area. The 4-partition TESSA can provide up to 21% lower area than
68 3 Centralized Connection Allocation for TDM CS NoCs
16,4slot 16,8slot 18,4slot 18,8slot 20,4slot 20,8slot
0
50
100
150
A
re
a/
 1
04
 µ
m
2
 
 
Non−partitioned TESSA
4 partitions TESSA
more than 4 partitions TESSA
Figure 3.53: Total area of non-partitioned and partitioned NoCMs in diﬀer-
ent size NoC with diﬀerent slot table size. The x-axis label ‘16, 4slot’ indicates
16x16 mesh with 4 slots.
non-partitioned (@20x20 mesh with 4 slots). The area of TESSA grows with O(S· M) in
2D-mesh (M= #routers, S=slot table size).
3.7.3.2 Simulation Results
The allocation speed and success rate of partitioned NoCMs are compared to the state
of the art centralized and distributed allocation techniques under uniform random traﬃc.
We re-implemented a distributed parallel probe connection setup approach according
to Liu’s work [LJL14b] for comparison, with a retry deadline attached to each request.
The request queue sizes of non-partitioned NoCM, single 9-partition NoCM and single
4-partition NoCM are 64, 7 and 16, respectively. The evaluated performance metrics,
success rate, total allocation time and GS oﬀered load have been explained in previous
section. The results are explained in the following sections.
Any data point shown in the ﬁgures comes from simulation of 1 million cycles. The NoC
issues a connection request to the NoCM in a uniform random traﬃc depending on the
requirements of the GS oﬀered load. During simulation, half of the nodes are assumed
as masters and half of the nodes are assumed as slaves. The connection lifetime, i.e. the
number of ﬂits that each connection delivers, is set as 2000 ﬂits, 3000 ﬂits and 4000 ﬂits.
The retry deadline of parallel probe search [LJL14b] is set to 3000 cycles.
3.7 Partitioned Trellis Architecture 69
0.1 0.2 0.3 0.4 0.5 0.6
0
50
500
1,000
1,500
2,000
2,500
3,000
GS offered load
A
ve
ra
ge
 to
ta
l a
llo
ca
ito
n 
tim
e(
cy
cl
es
)
 
 
16x16,4 par TESSA,2000 flit
18x18,9 par TESSA,2000 flit
18x18,9 par TESSA,4000 flit
16x16,Non−par TESSA,2000 flit
18x18,Non−par TESSA,2000 flit
18x18,Non−par TESSA,4000 flit
16x16,probe search,2000 flit
18x18,probe search,4000 flit
Figure 3.54: Comparison of allocation speed of partitioned and non-
partitioned TESSA and probe search in diﬀerent network with Slot Table
Size of 8.
Comparison of allocation speed In parallel probe search, multiple trials might be
needed before the eventual success of search due to the investigation of single slot at a time.
On the contrary, in TESSA, all slots are being searched simultaneously. As shown in Fig.
3.54, the partitioned TESSA provides much higher allocation speed. Compared to parallel
probe search, it oﬀers 7X (@oﬀered load 0.1 in 16x16 mesh) to 71X (@oﬀered load 0.6 in
18x18 mesh) higher allocation speed. Compared to non-partitioned TESSA with oﬀered
load from 0.25 to 0.6, it oﬀers 48% (@oﬀered load 0.25 in 16x16 mesh with 2000 ﬂits) to
72X (@oﬀered load 0.6 in 18x18 mesh with 2000 ﬂits) higher speed. We can see in non-
partitioned TESSA with low request rate, e.g. lower oﬀered load (from 0.1 to 0.2) or longer
connection lifetime (4000 ﬂits), the allocation time is reduced signiﬁcantly. The reason for
this is in high request rate the non-partitioned NoCM is too busy, so the incoming requests
have to wait longer in the queue to get their turn. However, in partitioned TESSA, since
there are multiple NoCMs working in parallel, the requests are usually processed in time
without waiting, thereby the allocation speed is much higher. In partitioned TESSA, with
diﬀerent connection lifetime (2000 ﬂits or 4000 ﬂits in 18x18mesh), the allocation time is
similar.
Comparison of Success Rate The simulation results of success rate is illustrated
in Fig. 3.55 and Fig. 3.56. For partitioned TESSA, the systems that are divided into 4
partitions achieve the best success rate in comparison to those with 9 or 16 partitions.
When the GS oﬀered load is high, the partitioned TESSA provides much higher success
70 3 Centralized Connection Allocation for TDM CS NoCs
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
GS offered load
S
uc
ce
ss
 R
at
e
 
 
16x16,16 par TESSA,2000 flit
16x16,4 par TESSA,2000 flit
20x20,16 par TESSA,3000 flit
20x20,4 par TESSA,3000 flit
16x16,Non−par TESSA,2000 flit
20x20,Non−par TESSA,3000 flit
16x16,probe search,2000 flit
20x20,probe search,3000 flit
Figure 3.55: Comparison of Success Rate of partitioned and non-partitioned
TESSA and probe search in 16x16 and 20x20 networks with Slot Table Size
of 8.
rate than non-partitioned TESSA and parallel probe. In parallel probe search, since single
slot is investigated at a time and the retry before deadline policy is employed, the search
can be stopped as failure when the deadline is reached, though the remaining slots may
provide an available path. Moreover, it only searches the minimal path, cannot detour
as in TESSA. Therefore, compared to parallel probe search, the 4 partitions TESSA can
oﬀer up to 55% higher success rate in 16x16 mesh and up to 33% higher in 18x18 mesh. In
non-partitioned TESSA with high oﬀered load, many incoming requests must be rejected
as failure immediately because of the full request queue even though there might be a
free path for these requests. Hence, compared to non-partitioned TESSA, the 4-partition
TESSA can provide up to 85% higher success rate in 16x16 mesh, up to 84% higher success
rate in 18x18 mesh, and up to 75% higher in 20x20 mesh. However, when the oﬀered load
is very low, e.g. lower than 0.3 in 16x16 and 18x18 mesh, the single NoCM can handle
all the requests in time. In partitioned TESSA, since the search among NoCMs is along
minimal path, its path diversity is less than the non-partitioned TESSA, thereby now the
success rate of non-partitioned TESSA is the best.
3.7.3.3 Suggestion on how to partition the system
From the simulation results we can see, the partitioned system does not necessarily always
provide better performance than non-partitioned system. If the request injection rate is
not heavy, the performance of partitioned system may even be worse than non-partitioned
3.7 Partitioned Trellis Architecture 71
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
GS offered load
S
uc
ce
ss
 R
at
e
 
 
9 par TESSA,4000 flit,4slot
9 par TESSA,4000 flit,8slot
9 par TESSA,2000 flit,4slot
9 par TESSA,2000 flit,8slot
4 par TESSA,4000 flit,8slot
4 par TESSA,2000 flit,8slot
Non−par TESSA,4000 flit,8slot
Non−par TESSA,2000 flit,8slot
probe search,2000 flit,8slot
Figure 3.56: Comparison of Success Rate of partitioned and non-partitioned
TESSA and probe search in 18x18 network with Slot Table Size of 4 or 8.
system. The request injection rate indicates the number of requests generated by each
master per cycle. Then how to partition the system? In this section, we provide an intuitive
suggestion to this question.
The reason to partition the system is given in a large NoC with heavy request injection
rate, when there are too many requests to the NoCM in a short time that the NoCM
cannot process the requests in time, and thus some requests may even need to be discarded.
Therefore, when the injection rate of connection request exceeds the NoCM’s processing
capacity, the system should be partitioned. We take folded FB TESSA as an example.
Assume half of the nodes in the NoC are masters and half of the nodes are slaves. ForN ·N
mesh network, since the average path length is N − 1 hops, so the average processing
time for single request is 2 ·(N−1) cycles. Hence, the corresponding boundary of request
injection rate that is under the NoCM’s processing capacity is 1÷ ((2 · (N − 1)) · (N ·
N · 12)) = 1N2(N−1) request per cycle per master. If the request injection rate is higher
than this boundary, the manager cannot process the incoming requests in time, and thus
it is better to partition the system. The boundary of request injection rate that ensures
the NoCM can process the requests in time in diﬀerent NoC sizes is shown in Fig. 3.57.
As long as the average request injection rate is above this curve, we are going to partition
the system, and the partition size is chosen as to make the local NoCM in each partition
can process the incoming requests in time.
We should note this is just a preliminary idea, and we do not consider the communication
cost among NoCMs. In reality, if the system is partitioned into too many partitions, the
72 3 Centralized Connection Allocation for TDM CS NoCs
100 200 300 400 500 600 700
0
2
4
6
8
10
12
#Routers
R
eq
ue
st
 in
je
ct
io
n 
ra
te
 (1
0−
4  
re
q 
pe
r 
cy
cl
e)
Figure 3.57: The boundary of request injection rate under the NoCM’s ca-
pacity in diﬀerent NoC sizes.
communication cost among NoCMs will become dominant, and thus the system perfor-
mance may decrease. Hence, as long as the local managers already can handle the requests
in time, more partitions does not necessarily provide better performance any more.
3.8 Summary
This chapter introduces a dedicated connection allocator, NoCM, which employs the trellis
path search algorithm for the connection allocation of TDM CS. The results are summa-
rized below,
• We proposed the TESSA algorithm, which can explore all possible paths between
source-destination node pairs within a guaranteed latency.
• We proposed the FB TESSA, which comprises two steps: forward search and path
backtrack. Three diﬀerent TESSA structures, unfolded structure, folded structure
and bidirectional structure are presented to be chosen for diﬀerent scenarios.
• In order to save the path search time, the RE TESSA is proposed, which merges
the forward search and path backtrack into single step.
• The single-layer TESSA is proposed that only needs to implement single layer of
trellis. Hence, compared to previous multi-layer approach, the resource consumption
is reduced dramatically.
3.8 Summary 73
• In order to address the scalability problem of centralized system, the partitioned
structure is proposed, which divides the system into multiple partitions with multiple
local NoCMs. Since each NoCM only manages and communicates with its local NoC
nodes, the request load and communication overhead is reduced. As the NoCMs can
work simultaneously, the computation capacity is enhanced.
74 3 Centralized Connection Allocation for TDM CS NoCs
Chapter 4
Centralized Connection Allocation
for Combined TDM-SDM CS NoCs
In this chapter, we present the trellis path search for the connection allocation of combined
TDM and SDM CS[CMF17c]. TDM can share the resource by splitting the link bandwidth
into time slots. But there is a constraint on the time slot scheduling that after each hop
the reserved slot along the path should be increased by 1, which can limit the probability
of successful connection allocation. To mitigate this, SDM is proposed, which physically
splits the link wires into sub-channels, and any free sub-channel at the next hop along
the path can be reserved. However, the area cost of SDM switch scales quadratically with
the number of sub-channels, which limits the scalability. Moreover, the number of sub-
channels the link can be split into is limited by the bits of the link wires. In order to
address the problem of i) path diversity and ii) scalability, the combined TDM and SDM
CS was proposed, in which the sub-channel is further split into time slots, and thus can
increase the path diversity as well as share a sub-channel among multiple connections
to improve the resource utilization. Hence, the poor resource usage inherent to CS is
mitigated. In this chapter, we propose a dedicated connection allocator for combined
TDM-SDM CS NoCs based on trellis-search algorithm, which can explore all possible
paths between source-destination node pairs within a guaranteed latency. All the trellis
structures presented in chapter 3 can be applied to combined TDM and SDM CS. Finally,
we studied the inﬂuence of diﬀerent TDM-SDM link partitioning strategies on success rate
and path length that allowed us to ﬁnd the optimal solution.
4.1 Introduction of SDM CS
Due to the advanced semiconductor technology, wires have become an abundant resource
in NoC. However, using all the wires between two routers as single wide communication
channel becomes ineﬃcient and inadequate in many situations. Hence, we can adopt
75
76 4 Centralized Connection Allocation for Combined TDM-SDM CS NoCs
R0
0 1 2 0 1 2 0 1 2
R1 R2
(a) In this TDM NoC, each link is split into 3 time slots. Along the path, if slot 0 is
reserved in R0, slot 1 must be reserved at the next hop.
R0 R1 R2
0
1
2
0
1
2
0
1
2
(b) In this SDM NoC, each link is split into 3 sub-channels. Along the path, if sub-
channel 0 is reserved in R0, any free sub-channel can be reserved at the next hop.
Figure 4.1: Connection allocation in TDM CS and SDM CS.
the SDM technology to physically split the wires into several sub-channels to oﬀer more
ﬂexibility[LJL15, EJ13, LMV+08, MSAA09, YKH10, LJL14a]. In TDM CS, there is tight
scheduling constraint on time slots reservation that, if a slot t is reserved in a router
and then slot (t + 1) mod S must be reserved at next hop along the path, where S
is the slot table size, which limits the probability of path establishment in the network.
However, unlike TDM, in SDM, no matter which sub-channel is reserved in a router, any
free sub-channel at the next hop can be reserved along the path, which can provide more
path diversity, as depicted in Fig. 4.1.
4.1.1 Combined TDM and SDM CS
In SDM CS, as the number of sub-channels increase, the cost of the crossbar switch of a
router increases quadratically. If n is the number of sub-channels of a link, the cost of the
crossbar of a router scales with O(n2). The quadratic complexity can be reduced by using
a multiple-stage switch, but then the switch delay will be increased. Moreover, the number
of sub-channels is limited by the number of wires, so it cannot be increased arbitrarily.
As NoCs scale up in size, or as traﬃc ﬂows increase, the number of required sub-channels
increases, which may induce unacceptable router complexity or be just impossible because
of insuﬃcient wires.
In TDM CS, if we have more ﬂows and need a ﬁner granularity for bandwidth allocation,
a larger slot table is required. Though increasing the slot table size does not increase the
area much, however, there is a direct relation between table size and maximum delay.
If the link is split into n slots, a packet has to wait n cycles for its next slot to appear
[GEEK11]. Moreover, the tight scheduling constraint on time slots reservation restricts
the path diversity.
Hence, the combined TDM and SDM CS is proposed. Combining TDM and SDM can oﬀer
more ﬂexibility since optimization can be made either in time or in space. As depicted
4.2 System Model 77
O1             i1
O3  i3
i0
O0
O2
 i2
S0 S1 S2 O1             i1
O3  i3
i0
O0
O2
 i2
R1 R2
S0 S1 S2
S0 S1 S2
S0 S1 S2
S0 S1 S2
S0 S1 S2
S0 S1 S2
S0 S1 S2
S0 S1 S2
S0 S1 S2
S0 S1 S2
S0 S1 S2
S
0
S
1
S
2
S
0
S
1
S
2
S
0
S
1
S
2
S
0
S
1
S
2
S
0
S
1
S
2
S
0
S
1
S
2
S
0
S
1
S
2
S
0
S
1
S
2
S
0
S
1
S
2
S
0
S
1
S
2
S
0
S
1
S
2
S
0
S
1
S
2
S
0
S
1
S
2
S
0
S
1
S
2
S
0
S
1
S
2
S
0
S
1
S
2
Figure 4.2: Combined TDM-SDM routers with each link split into 2 sub-
channels and 3 time slots. Along the path, if time slot 1 is reserved at a sub-
channel in router R1, slot 2 at any of the two sub-channels can be reserved in
R2.
in Fig. 4.2, each sub-channel is split into multiple time slots so that it can be shared
by multiple connections, and thus the resource utilization of sub-channel is improved.
Consequently, the number of sub-channels and time slots can be kept moderate and thus
reducing the router complexity. Actually, the TDM CS and SDM CS are special cases of
combined TDM-SDM CS. In combined TDM-SDM CS, when the link is split into only 1
sub-channel, it will be TDM CS; and if the link is split into only 1 time slot, it will be
SDM CS.
4.1.2 Connection allocation in combined TDM and SDM CS
The connection allocation in combined TDM and SDM CS has to i) allocate a
contention-free path from source node to destination node and ii) allocate the time
slots and sub-channels along the path. As depicted in Fig. 4.2, the path is allocated
as R1→ R2, and the corresponding slot sequence and sub-channels along the path are
{(slot 0 subchannel 0), (slot 1 subchannel 1), (slot 2 subchannel 1)}.
4.2 System Model
The system model of dedicated allocator (i.e. NoCManager) based NoC architecture is
illustrated in Fig. 4.3, which is similar to the system model presented in chapter III.
78 4 Centralized Connection Allocation for Combined TDM-SDM CS NoCs
Des
Src
NoCM
2. Allocation info
3. Connection 
1. Connection request
4. Connection release
Figure 4.3: System Model of the NoCManager based NoC platform. Src:
source node, Des: destination node.
0 1
2 3
Figure 4.4: Example NoC graph. Each link has two sub-channels. A node
can reach itself (curve arrow).
NoCManager (NoCM) attempts to allocate the appropriate connections when it receives
connection requests.
When a node needs a connection, it sends the connection reservation request to the
NoCM. The NoCMs tries to allocate the connection when it receives the request. When the
allocation succeeds, NoCM sends the resulting allocation information to the corresponding
source node. Then the source node starts to transmit data along the allocated connection
based on the allocation information. When the data transfer is ﬁnished, the source node
deletes the established connection, and also informs the NoCM to free the corresponding
allocated sub-channels and time slots.
4.3 Connection Allocator Architecture
In this section, we present a dedicated hardware connection allocator (NoCM) to allocate
connection for TDM-SDM NoCs, which employes the trellis path search algorithm that
can guarantee the setup latency and explore all possible paths between two given nodes.
4.3 Connection Allocator Architecture 79
4.3.1 Formalizing The Trellis Graph Structure
The aforementioned path search problem can be eﬃciently solved by trellis search ap-
proach (TESSA). The example NoC in Fig. 4.4 can be represented by the trellis graph in
Fig. 4.5. There are three important characteristics of the trellis graph, which are:
1. Stages: The stage (i.e. the column) of the trellis graph is a recursion of the network,
as in Fig. 4.5. The trellis graph can be constructed as multi-stage (by default, 2N−2
stages for NxN mesh NoC, which equals the longest minimal path in the network)
to represent the multi-hop traversals through the network.
2. States: The node in the trellis graph is called state, which represents a router in
the NoC.
3. State transitions: Each state transition corresponds to a forward step in the trellis,
which is represented by a directed edge (or branch) connecting the two states. A
branch represents a link at a speciﬁc time slot.
In this section, we adopt the multi-layer approach that every slot at the initial stage has
its own layer of trellis graph, so each slot from the source node can search its own path
in parallel.
4.3.2 Trellis Path Search Algorithm
In this section, the forward-backtrack trellis is adopted, which completes the path search
in two steps:
• Forward search i.e. traverse the NoC to search the best free path out of all possible
survivor paths.
• Backtracking i.e. when the forward search succeeds, the backtrack starts to collect
the survivor path, i.e. the contention-free path which is to be found.
All the trellis search approaches presented in chapter 3 can be applied to combined TDM-
SDM CS, and we take unidirectional FB trellis search and bidirectional FB trellis search
as example in this section.
4.3.2.1 Unidirectional Trellis Path Search
The unidirectional trellis graph of example network in Fig. 4.4 is illustrated in Fig. 4.5a.
The source node at the initial stage sends out the search signal to try to activate its
connected neighbors, and the activated nodes continue to forward the search until reaching
80 4 Centralized Connection Allocation for Combined TDM-SDM CS NoCs
C1
C20
C1
C21
C1
C22
C1
C23
0
1
2
3
Time slot index
n mod S (n+1) mod S (n+2) mod S
stage
Src
Des
Forward search
Edge unavailable, 
failed search
Backtracking
C1
C20
C1
C21
C1
C22
C1
C23
(a) Unidirectional trellis search graph
C1
C2 0
C1
C2 1
C1
C2 2
C1
C2 3
0
1
2
3
Time slot index
n mod S (n+1) mod S (n+2) mod S
stage
Src
Des
Forward search
Edge unavailable, 
failed search
Backtracking
Search meet?
C1
C20
C1
C21
C1
C22
C1
C23
(b) Bidirectional trellis search graph
Figure 4.5: The example NoC can be represented by trellis. Assume each
link has 2 sub-channels (C1 and C2). The green dotted arrow indicates the
search of this edge failed because it is not available at the moment (already
occupied). Src: source node, Des: destination node.
the last stage. During the search, if any sub-channel of a branch is free, the search signal
can propagate along this branch. At the last stage, if the destination node is active, the
backtrack starts to collect the survivor path and the associated sub-channels and time
slots.
4.3 Connection Allocator Architecture 81
Survivor Path
DS
DS
DS
DS
DS
DS
DS
DS
AND
AND
AND
AND
N0
N1
N2
N3
Src
Des
Search meet?
N0
N1
N2
N3
Figure 4.6: Implementation schematic of the bidirectional trellis graph
4.3.2.2 Bidirectional Trellis Path Search
In order to reduce the search time, we proposed the bidirectional search, which launches
the search at two sides, source node and destination node simultaneously. At the middle
stage of the trellis, it checks whether the searches from two sides meet at a node. If this is
true, the backtrack starts from the middle stage to select the survivor path. Otherwise, it
fails. Hence, the search time is reduced to half. The bidirectional trellis graph of example
network in Fig. 4.4 is illustrated in Fig. 4.5b. The search begins at the source (at the
initial stage) and destination (at the last stage) simultaneously, traversing through the
trellis to try to activate the connected neighbors. Source and destination both activate
node 2 at middle stage. Hence, the search succeeds. The backtrack starts from node 2 at
middle stage to source and destination simultaneously. The survivor path is selected as
N0→ N2→ N3.
4.3.3 Trellis Path Search Implementation
The implementation schematic of a bidirectional trellis graph is shown in Fig. 4.6. Each
state is implemented as Detect-Select (DS) Unit, as detailed in Fig. 4.7, which evaluates
the incoming search signals to determine whether this state can be activated. If yes, it
forwards the active search to next stage. The actual branch state (link allocation state at
a speciﬁc slot) is stored in the ‘Branch State’ register, with each element representing a
speciﬁc sub-channel. When a sub-channel is allocated at a speciﬁc slot, its associated state
register is set to ‘0’ to exclude it from later allocation. Correspondingly, state register is
reset to ‘1’ to allow allocation when the sub-channel is released.
The work-ﬂow of DS unit for each slot is the same, which can be divided into two steps:
1. Detect the active incoming branches: the state registers of all sub-channels of a
82 4 Centralized Connection Allocation for Combined TDM-SDM CS NoCs
OR
AND
C1
C2
AND
Survivor path 
Register
N1
output
Branch 
active?
Branch state reg
2 Sub-channels 
N1       N0
N2       N0
2 Sub-channels
Branch state reg
N2
N0
In
c
o
m
in
g
 S
e
a
rch
=DS
OR
C1
C2
OR
Figure 4.7: Implementation details of a DS unit with 2 sub-channels for
single slot of node 0
branch are connected to an OR gate, so that as long as any sub-channel is valid
(free), this branch is valid. The branch becomes active only when the branch as well
as (‘AND’ operation) the corresponding incoming search are both valid. A node can
be active as long as any incoming branch is active (‘OR’ operation).
2. Select one active branch as survivor path: one and only one of the active incoming
branches is selected as ‘survivor path’, saved in the ‘survivor path Register’. At the
same time, the active search is forwarded to next stage. Note, if a node is already
active, in the future, it will select itself as the predecessor.
The path search is completed in two cycles. First, the search is started at source (at initial
stage) and destination (at last stage) simultaneously, propagated until reaching the middle
stage. At the middle stage, the corresponding search signals from two sides are connected
to an AND gate to check whether searches meet. At the next cycle, if two searches meet,
backtrack starts from the middle stage by reading out the stored predecessor hop by hop.
Each selected predecessor and one of its valid associated sub-channels at each stage are
saved into the ‘Survivor Path’ register, where eventually the complete path from source
to target node can be found.
4.4 Performance Evaluation
The synthesis and simulation results are presented in this section.
4.4 Performance Evaluation 83
20 40 60 80 100
0
50
100
150
200
250
300
#Routers
A
re
a/
 1
04
 µ
m
2
 
 
16slot−1subchannel
16slot−1subchannel,Bidirectional
4slot−4subchannel
4slot−4subchannel,Bidirectional
4slot−8subchannel,Bidirectional
8slot−4subchannel,Bidirectional
Figure 4.8: Area of (unidirectional) TESSA and Bidirectional TESSA NoC-
Managers with diﬀerent link partitioning for diﬀerent NoC sizes.
4.4.1 Synthesis Results
The NoCManager is available in synthesizable VerilogHDL and can be generated out of
an XML description for diﬀerent NoC sizes. Using Synopsys Design Compiler, the NoCM
was synthesized with TSMC 65 nm technology, for diﬀerent mesh networks of size 4x4
to 10x10. For both (unidirectional) TESSA and Bidirectional TESSA, the critical path
constraints were gradually increased according to the increased NoC size. For instance,
in Bidirectional TESSA, the critical path is constrained to 1.11 ns for 6x6 mesh which is
then increased to 2 ns for 10x10 mesh. The link is split into diﬀerent partitions (diﬀerent
time slots and diﬀerent sub-channels), e.g. 16slot-1subchannel, 4slot-4subchannel, 4slot-
8subchannel, etc.
As shown in Fig. 4.8, the area of TESSA and bidirectional TESSA NoCManagers grows
with O(S· M· C· √M) in 2D-mesh (M= #routers, S= #slots, C= #sub-channels),
where
√
M is related to the number of trellis stages (2· (√M -1)). The area grows
rapidly with slot table size, while quite slowly with the number of sub-channels. The area
of bidirectional TESSA is almost the same as the (unidirectional) TESSA. As shown in
Fig. 4.9, the average AT complexity of bidirectional TESSA is almost 2X better than
the (unidirectional) TESSA. As shown in Fig. 4.10, the average energy consumption per
allocation grows with O(S· M· √M). The inﬂuence of the number of sub-channels on
energy is almost negligible. In conclusion, the bidirectional TESSA can halve the critical
path so that it can halve the path search time compared to (unidirectional) trellis search
algorithm, while consuming almost the same area and energy.
84 4 Centralized Connection Allocation for Combined TDM-SDM CS NoCs
20 40 60 80 100
0
500
1000
1500
2000
2500
#Routers
A
T 
pe
r 
al
lo
ca
tio
n 
(1
04
 µ
m
2 /
G
H
z)
 
 
16slot−1subchannel
16slot−1subchannel,Bidirectional
4slot−4subchannel
4slot−4subchannel,Bidirectional
4slot−8subchannel,Bidirectional
8slot−4subchannel,Bidirectional
Figure 4.9: Average AT complexity per allocation of two diﬀerent NoCMan-
agers with diﬀerent link partitioning for diﬀerent NoC sizes.
10 20 30 40 50 60 70 80 90 100
0
200
400
600
800
1000
1200
1400
1600
1800
2000
#Routers
E
ne
rg
y 
pe
r a
llo
ca
tio
n 
(P
ic
oJ
ou
le
)
 
 
16slot−1subchannel
16slot−1subchannel,Bidirection
4slot−4subchannel
4slot−4subchannel,Bidirection
4slot−8subchannel,Bidirection
8slot−4subchannel,Bidirection
Figure 4.10: Average energy consumption per allocation of two diﬀerent
NoCManagers with diﬀerent link partitioning for diﬀerent NoC sizes.
4.4.2 Simulation Results
We evaluate the inﬂuence of diﬀerent link partitioning strategies on success rate and
average ﬂit delivery latency. These results are explained in this section.
4.4 Performance Evaluation 85
For evaluation several performance metrics are used:
• success rate denotes the ratio of successful requests that established paths to the
total requests.
• GS oﬀered load refers to the ratio of the GS traﬃc each master oﬀered to its maxi-
mum capacity. Suppose the link width is 256 bits, a new request generated by each
master after every 2000 cycles, and each connection can deliver 128,000 bits data
after setup, then the oﬀered load is 128000÷ 256÷ 2000 = 0.25.
• background traﬃc load refers to the ratio of links capacities (time slots and sub-
channels) of each router which are already randomly occupied to be excluded from
path search to the total links capacities of each router.
In this section, the simulations are done with the same 256 bits wires per link in 6x6 mesh
network.
4.4.2.1 Inﬂuence of diﬀerent link partitioning on success rate
The NoC issues a connection request to the NoCM in a uniform random traﬃc meeting
the requirements of the GS oﬀered load. During simulation, half of the nodes are assumed
as master and half nodes are assumed as slave. The source-destination pairs are uniform
randomly selected that each source-destination pair has equal probability to be chosen.
Any data point that is shown in the ﬁgures comes from simulation of 1 million cycles. The
link width is set as 256 bits, so each subchannel is 128 bits and 16 bits in 2-subchannel
and 16-subchannel partitioning, respectively.7 After a connection is established, 128,000
bits of data are delivered before the connection is released.
When the link is split into ﬁxed number of partitions (#time slots ·#subchannels),
splitting the link into greater number of subchannels can provide more path diversity,
which can contribute to higher success rate. As depicted in Fig. 4.11, the 2slot-8subchannel
partitioning can oﬀer up to 4% higher success rate than 16slot-1subchannel partitioning (@
GS oﬀered load 0.1). However, after certain number of sub-channels, more sub-channels do
not necessarily contribute to higher success rate. The success rate of 1slot-16subchannel
is even a little lower than 2slot-8subchannel. The reason is that the network reaches
saturation point. It cannot allocate more successful connections simply because it has
already reached the limit of NoC capacity. When the link is split into 16 partitions, it
can only support at most 16 concurrent connections no matter how it is split. Since 1slot-
16subchannel partitioning oﬀers more path diversity, it can successfully allocate some long
paths, which occupy a lot of link capacities, and thus the network reaches the saturation
early. Alternatively, if we increase the partitions of the link, it can provide more path
7 We assume only payload data is delivered without header overhead, because the packet header could
be omitted in CS.
86 4 Centralized Connection Allocation for Combined TDM-SDM CS NoCs
0 0.2 0.4 0.6 0.8 1
0.8
0.85
0.9
0.95
1
GS offered load
S
uc
ce
ss
 R
at
e
 
 
8slot−1subchannel
1slot−8subchannel
16slot−1subchannel
8slot−2subchannel
4slot−4subchannel
2slot−8subchannel
1slot−16subchannel
4slot−8subchannel
8slot−4subchannel
Figure 4.11: Inﬂuence of diﬀerent link partitioning on success rate in 6x6
mesh.
0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Background traffic load
S
uc
ce
ss
 R
at
e
 
 
16slot−1subchannel
8slot−2subchannel
4slot−4subchannel
2slot−8subchannel
1slot−16subchannel
32slot−1subchannel
4slot−8subchannel
8slot−4subchannel
Figure 4.12: Inﬂuence of diﬀerent link partitioning on success rate under
background traﬃc.
diversity as well as more NoC capacity. Compared to 1slot-16subchannel partitioning,
both 4slot-8subchannel and 8slot-4subchannel can oﬀer higher success rate. Since more
sub-channels leads to higher router complexity, it could be a better choice to split the
sub-channel into time slots instead of increasing the number of sub-channels (e.g. 2slot-
8subchannel against 1slot-16subchannel), which can provide good success rate as well as
relatively low router complexity.
4.4 Performance Evaluation 87
4.4.2.2 Evaluation of diﬀerent link partitioning under certain background
traﬃc
In this section, we run the simulation under pre-deﬁned background traﬃc to show that
increasing the sub-channels can increase the path diversity signiﬁcantly. Moreover, the
number of trellis stage is set as 16 (by default is 10 stages for 6x6 mesh) to allow the
exploration of non-minimal paths. We ﬁrst generate an amount of random background
traﬃc that uses some of the link capacities. Then we measure the path length and success
rate of path allocations between all pairs of two nodes. We produce 300 samples at each
background traﬃc load.
Inﬂuence of diﬀerent link partitioning on success rate From the simulation re-
sults of success rate in Fig. 4.12, we can see the 1slot-16subchannel partitioning can
oﬀer up to 3X higher success rate than 16slot-1subchannel partitioning (@ background
traﬃc load 0.95). But the success rate of 2slot-8subchannel is almost the same as 1slot-
16subchannel. Alternatively, we can split each sub-channel into more time slots (increase
the link partitions) to achieve higher success rate. The 4slot-8subchannel can provide up
to 2.4X higher success rate against 2slot-8subchannel and 1slot-16subchannel.
Inﬂuence of diﬀerent link partitioning on delivery latency and path length
Some messages, e.g. control message, whose message size might be small, but have strict
requirement for the delivery latency. So in this section, we present the comparison results
of delivery latency, i.e. latency for delivering single ﬂit. For each ﬂit, the delivery latency
comprises two parts for TDM-SDM CS: 1) waiting time in TDM scheme to get its time
slot to be delivered, which equals the slot table size; and 2) the routing time to traverse
the NoC, which equals the path length. For example, assume a connection in 8slot- 2sub-
channel CS has path length of 7 hops, then the delivery latency per ﬂit in total is 15
cycles (= 8 + 7).
The average delivery latency per ﬂit and the average path length per connection is shown
in Fig. 4.13 and Fig. 4.14. As we can see, when the background traﬃc load is heavy (more
than 0.94), the higher the background traﬃc load is, the shorter the path is. However,
when the background traﬃc is low (lower than 0.9), the lower the background traﬃc is,
the shorter the path is. The reason is, when the background traﬃc load is very high,
the source node can only reach the nearby nodes. By reducing the background traﬃc,
the source node can reach more and more nodes, and thereby the path becomes longer.
However, after certain point, the source node can reach all nodes in the network but may
have to do a lot of detours. So if we keep reducing the background traﬃc, the number of
detours becomes smaller, and thereby the path becomes shorter.
When the link is split into more sub-channels, or the sub-channel is split into more slots,
the path diversity is enhanced, which may reduce detours and thereby shorten path length.
88 4 Centralized Connection Allocation for Combined TDM-SDM CS NoCs
0.4 0.5 0.6 0.7 0.8 0.9 1
3
4
5
6
7
8
9
Background traffic load
P
at
h 
le
ng
th
 (h
op
s)
 
 
16slot−1subchannel
8slot−2subchannel
4slot−4subchannel
2slot−8subchannel
1slot−16subchannel
4slot−8subchannel
8slot−4subchannel
Figure 4.13: Inﬂuence of diﬀerent link partitioning on average path length
under background traﬃc.
0.5 0.6 0.7 0.8 0.9 1
0
5
10
15
20
25
Background traffic load
Fl
it 
La
te
nc
y 
(C
yc
le
s)
 
 
16slot−1subchannel
8slot−2subchannel
4slot−4subchannel
2slot−8subchannel
1slot−16subchannel
4slot−8subchannel
8slot−4subchannel
Figure 4.14: Inﬂuence of diﬀerent link partitioning on average delivery la-
tency under background traﬃc.
As Fig. 4.13 shows, the path in the 1slot-16subchannel partitioning is the shortest, which
can be reduced by half against 16slot-1subchannel partitioning (@background traﬃc load
0.87). When the background traﬃc is low (lower than 0.93), the path length of 8slot-
4subchannel partitioning is shorter than 4slot-4subchannel. If the link is split into more
time slots, a message has to wait longer time for its slot to appear, which can induce
longer delivery latency per ﬂit. As Fig. 4.14 shows, the delivery latency in the 1slot-
4.5 Summary 89
16subchannel partitioning can be reduced by a factor of 6 against 16slot-1subchannel
partitioning (@background traﬃc load 0.87).
4.5 Summary
This chapter presents a dedicated connection allocator, NoCM, which employs the trellis
path search algorithm for the connection allocation of combined TDM-SDM CS. The
results are summarized below,
• In order to mitigate the poor resource utilization problem of CS, the combined TDM-
SDM CS is proposed, which can increase the path diversity and improve sharing of
sub-channel among multiple connections.
• For connection allocation of combined TDM-SDM CS, we proposed the TESSA
based dedicated allocator.
• The unidirectional and bidirectional TESSA are presented and compared in this
chapter. Since the bidirectional approach does the search at two sides, source node
(at the initial stage) and destination node (at the last stage) simultaneously, the
search time is halved compared to unidirectional approach, while the area and energy
consumption is almost the same.
• Finally, in order to investigate and optimize TDM-SDM partitioning strategy, we
studied the inﬂuence of diﬀerent TDM-SDM link partitioning strategies on success
rate and path length.
90 4 Centralized Connection Allocation for Combined TDM-SDM CS NoCs
Chapter 5
Router Design
This chapter presents the router architecture that combines the circuit-switching network
and packet-switching network in order to eﬃciently and separately handle the GS and
BE traﬃcs. The GS message is transmitted over the circuit-switching network along the
pre-reserved connection. On the other hand, the packet-switching network can utilize the
unreserved resource to transmit BE message, which can increase the resource utilization.
Furthermore, when the allocation of the requested connection fails, the corresponding
GS message can be transmitted over the packet-switching network, which provides an
additional solution to the unsuccessful GS traﬃc. Part of the results presented here have
been previously published in [CMF17b].
5.1 System Model
The system model of a dedicated allocator (i.e. NoCManager) based NoC architecture
is illustrated in Fig. 5.1. NoCManager (NoCM) attempts to allocate the appropriate
connections when it receives connection requests.
Typically, there are two kinds of routing mechanisms for GS communication, source rout-
ing and distributed routing. In source routing, all routing decisions (path information) for
a packet are made entirely in the source terminal by table lookup of a precomputed route.
The path information is embedded in the packet header by the source node for each hop
to indicate which output port to go (3 bits for ﬁve directions in 2-D mesh network, east,
west, north, south and local). In contrast, in distributed routing, the routing is performed
by storing the routing information distributed in the routing nodes along the path rather
than in the source terminals. So the routing decision is made at each hop, and each node
needs to hold only the routing decision of current node rather than the entire path. The
advantage of source routing is that, as long as the source node has received the allocation
information from NoCM, then the data transmission can start. However, in distributed
routing, the source node has to set up the connection that has to conﬁgure the routers
91
92 5 Router Design
Des
Src
NoCM
2. Allocation info
3. Connection 
1. Connection request
4. Connection release
Figure 5.1: System Model of the NoCManager based NoC platform.
along the path hop by hop before starting data transmission. Hence, in order to save con-
nection setup time, the source routing is employed in our system for GS communication.
In the source routing, the packets are inserted into the network only at speciﬁc time slots,
which is regulated by the source terminal, same as in Aelite[HG10].
In our system, when a source node needs a connection, it sends the connection reservation
request to the NoCM. The NoCM tries to allocate the connection when it receives the
request. When the allocation succeeds, NoCM sends the resulting allocation information
to the corresponding source node. After receiving the allocation information, source node
starts to transmit GS data along the connection. When the data transfer is ﬁnished, the
source node deletes the allocation information, and sends a BE packet to inform NoCM
that the connection is released. Since the NoC does not drop any packet, this BE packet
will reach the NoCM eventually.
There are three possible schemes for the communication between NoCM and NoC nodes,
i.e. over a dedicated conﬁguration network, connected via dedicated wires, or use the ex-
isting NoC. In order to achieve the high allocation speed, in this thesis, the connection
request from source node is sent to NoCM via dedicated wires, while the allocation in-
formation from NoCM to source node is delivered over the existing NoC as GS packet. If
the allocation of the GS path from NoCM to source node fails, the allocation information
will be sent as BE packet. In mesh network, the source node needs log2M bits wires to
deliver the connection request to indicate which node is the destination, where M is the
number of nodes in the NoC. Due to the partitioning architecture idea that divides the
large system into multiple smaller logic partitions with multiple local managers (explained
in section 3.7), each NoCM only manages and connects limited number of nodes in its
local region, so the dedicated NoCM-node wires will not be much overhead.
5.2 Proposed Router Architecture 93
Path 
info
Crossbar
Switch
in out
out
GS 
header
GS 
payload
Packet-switched
BE 
queue
HPU
Circuit-switched
Path 
info
in
GS 
header
GS 
payload
Packet-switched
BE 
queue
HPU
Circuit-switched
Figure 5.2: The proposed router architecture.
5.2 Proposed Router Architecture
5.2.1 Router architecture
The proposed router architecture consists of two modules as illustrated in Fig. 5.2 : a
circuit-switching part for transferring GS packet and a packet-switching part for trans-
ferring BE packet. The circuit switching and packet switching share the switch and links.
In our system, the packet is divided into small ﬂits. In the ﬂit header, there is a bit to
indicate whether this ﬂit is BE or GS. The GS ﬂits are prioritized over BE ﬂits. The BE
ﬂit is forwarded only if the output port is not reserved by GS at the moment. Each GS
ﬂit has two phits.
The router has ﬁve bidirectional ports. Four bidirectional ports are used to connect the
four adjacent routers, and the ﬁfth port is used to connect the local module. The routing
is distributed such that up to ﬁve packets can be simultaneously routed when they request
diﬀerent output ports. Since the BE ﬂit might be stalled in the router if the desired output
port is not available, there is a BE queue to buﬀer the BE data. However, since the GS
ﬂit is delivered along pre-scheduled contention-free path, it will never be stalled and there
is no need for buﬀering. The router has 2-stage pipeline, corresponding to a ﬂit size of
two phits, and hence a time slot has two cycles.
94 5 Router Design
H
e
a
d
e
r B
it: 
B
E
G
S
 re
se
rv
e
: 
S
o
u
th
Input port Output port
Input port
H
e
a
d
e
r B
it: 
G
S
GS reserve: 
Null
G
S
 re
se
rv
e
: 
N
u
ll
Output port
Header Bit: 
BE
Input port
Output port
Header Bit: 
BE
GS reserve: 
Null
Output port
Input port
North
South
W
e
st
E
a
st
GS flit
BE flit
B
E
 flit
B
E
 flit
G
S
 flit
BE flit
GS flit
G
S
 flit
BE flit
G
S
 flit
GS flit
B
E
 flit
GS flit
BE flit
G
S
 flit
B
E
 flit
Route 
computation
Outport 
arbiter
BE 
Routing
Switch
queue
Figure 5.3: Block diagram of the proposed combined BE-GS router with 4
ports. As an example a GS connection (green arrow) from port south to north
is established and the GS ﬂits from south to north are directly forwarded.
The first flit in a 
group carries header
Consecutive flits 
may skip header
The first flit in a 
group carries header
Unused slot
Used slot with 
header phit
Used slot with 
no header phit
End of Packet
Header phit 
overhead is 1/2
Figure 5.4: GS ﬂits.
5.2.2 Packet-switching Part
The packet-switching part consists of input queues and Header Parsing Units (HPUs), as
shown in Fig. 5.2. The input queues store the incoming BE packets. The HPUs do the
routing computation to determine the output port to which the fetched packet should be
forwarded, depending on the target address in the packet header. The fetched packet is
5.2 Proposed Router Architecture 95
Table 5.1: The control signals of GS phit
Signals Usage
00 Idle
01 Header
10 Payload
11 Tail, payload with EoP
then stored in a register until it receives a signal from the switch which indicates that the
corresponding output port is available. BE packets will be forwarded only if the desired
output port is free. Diﬀerent routing algorithms are available (deterministic or adaptive)
and can be used in a speciﬁc NoC realization. However, since in our system the packet
switching network only delivers the non-critical BE packets, it is kept as simple as possible.
Hence, in our system, the simple deterministic x-y routing is adopted for BE packet, with
a stall/go protocol to realize the ﬂow control, similar as in [WF11, MF16]. Round robin
arbitration is used in case multiple BE packets want to get access to the same output
port in the same clock cycle.
5.2.3 Circuit-switching Part
GS bases on the reservation of a connection between two given nodes, which is scheduled
by the NoCM, and has no routing restriction. In the routers along the connection the
corresponding input and output ports are reserved for the GS packet. There is no arbiter
for the GS packets because the contention is avoided by the schedule of NoCM. The router
has no notion of TDM scheme and blindly forwards the data on the input ports to the
correct output ports. As an example in Fig. 5.3 where a combined BE-GS router can be
found. In this example the router has 4 ports (north, east, south and west). This router is
part of a reserved GS connection. The connection crosses the router from south to north.
The connection reservation is done by setting the reserve registers in the routers’ ports
explicitly indicating whether the corresponding output port is a part of a connection and
which input port should be connected to it (here set the reserve register of north outport
as South inport). Due to the reservation, a GS ﬂit will always be forwarded as soon as it
arrives at a router independent of possible congestion which only aﬀects BE traﬃc.
Each GS ﬂit has two phits, which are grouped into two types: i) header or ii) payload
(payload or payload with End-of-Packet). There are 2-bit control signals to indicate dif-
ferent types for each phit, as listed in table 5.1. The control signals indicate the validity
of the ﬂit, and is used in order to release or maintain reserved connection. The value ‘00’
indicates the phit is not valid. ‘01’ indicates this is a header phit that contains the path
information to the destination node. ‘10’ indicates a payload phit, and the transfer of the
GS traﬃc is ongoing. When the value is ‘11’, it is payload but contains End-of-Packet
(EoP) to release the connection.
96 5 Router Design
Table 5.2: Resource Consumption of the Proposed Router with 65 nm tech-
nology
Data width (bits) Area (µm2) Power (mw) Critical path (ns)
16 10875 2.2 0.8
32 18085 3.2 0.8
48 24784 4.2 0.9
64 32801 4.8 0.9
80 40326 6.2 1
In a ﬂit, the ﬁrst phit is header, which sets up the connection (i.e. conﬁgures the corre-
sponding outport GS reserve registers) during routing. The next phit (payload) carries the
payload data, and follows the path of the ﬁrst phit blindly. Since the path is the same for
all ﬂits belonging to a speciﬁc connection, if there are multiple consecutive ﬂits belonging
to the same connection, only the ﬁrst ﬂit contains the header, and all the other ﬂits only
contain the payload data, as shown in Fig. 5.4. The connection (selected output port)
remains the same until the coming of the last phit (tail) that contains the EoP ﬂag to
release the connection. Within the header phit, the path ﬁeld holds a sequence of target
output ports, encoding the path through the network. Along each hop, the router looks
at the lowest bits (3 bits) of the path and then shifts those bits away.
5.3 Synthesis Results
Using Synopsys Design Compiler, the proposed router architecture was synthesized with
TSMC 65 nm technology. The size of BE queue is set as 4 words. Since the source routing
is employed, the router has no notion of TDM scheme, and thus splitting the link into
diﬀerent time slots does not aﬀect the router resource consumption. In this section, we
show the synthesis results with diﬀerent link data bits, as in table 5.2. The number of
data bits does not count the ﬁrst 3 bits, i.e. 1 bit for indicating BE ﬂit or GS ﬂit and 2
bits for control signals. For example, if the data width is 16 bits, the total link width is
19 bits. The synthesis results show the router area increases linearly with the data width,
and every 16 more bits increases the area by about 7500 µm2.
5.4 Summary
This chapter presents a router architecture for GS and BE traﬃcs. The results are sum-
marized below,
• The router consists of two parts: a circuit-switched part for transferring GS packet
and a packet-switched part for transferring BE packet. The packet switched network
5.4 Summary 97
will utilize the unreserved resource to increase resource utilization, and can provide
additional solution to the GS packet that failed in connection allocation.
• In this work, the source node sends the connection request to the central manager
via dedicated wires, and the manager sends back the response message as GS packet.
Due to the partitioned architecture that each manager only manages its local nodes,
the dedicated wires would not be too much overhead. Since the dedicated conﬁgu-
ration network that is widely used in previous works is avoided, the hardware cost
is reduced, while the setup latency is still guaranteed.
• Finally, the source routing is adopted in our design to save connection setup time,
and also to make router design simple since the router has no idea of the complicated
TDM scheme but just blindly forwards the GS data. Moreover, it makes it possible
to enable ordinary router to support TDM CS without too much modiﬁcation.
98 5 Router Design
Chapter 6
Conclusions and Future Work
This work was mainly concentrated on the connection allocation for circuit switching net-
works. In this thesis, we proposed a high performance dedicated centralized connection
allocator, NoCManager, to address the dynamic connection allocation problem. The NoC-
Manager solves the connection allocation problem based on trellis-search algorithm, which
can explore all possible paths between source-destination node pairs within a guaranteed
latency. We summarize the main contributions of the dissertation as follows.
1. The path search problem is solved step by step as dynamic programming to reduce
computation complexity, as well as to ensure path optimality (shortest path).
2. We search all slots in all directions simultaneously along multi-path, which can
complete the path search in a guaranteed latency as well as enhance the allocation
success probability.
3. The hardware architecture of NoCM is eﬃcient, and the critical path of each stage
is only an OR gate and an AND gate.
4. Diﬀerent trellis structures, unfolded trellis, folded trellis and bidirectional trellis are
presented. The unfolded trellis can achieve high allocation speed while the folded
trellis is more eﬃcient in terms of area. The bidirectional trellis can double the
allocation speed compared to unidirectional trellis. In diﬀerent scenarios, diﬀerent
trellis structures can be adopted according to speciﬁc system requirements.
5. The Register-Exchange technique is adopted that omits the backtrack step. Hence,
compared to forward-backtrack approaches where a backward phase is required to
build the path after the forward search, the allocation time is halved.
6. In order to reduce the resource consumption, we proposed the single-layer TESSA,
which is much more area and energy eﬃcient than previous multi-layer approach.
99
100 6 Conclusions and Future Work
7. In order to address the scalability problem of centralized systems, we proposed the
partitioned architecture, which divides the original system into multiple smaller
partitions served by multiple local managers. Since the managers can work simulta-
neously, the computation capacity is enhanced. As the NoC nodes only communicate
with their local managers, the communication overhead is mitigated.
8. In general, circuit switching NoCs suﬀer from limited path diversity and resource
utilization problem. In order to mitigate this problem, the TDM, SDM and com-
bined TDM and SDM resource sharing mechanisms are proposed to extend the
circuit switching. In this thesis, we proposed connection allocation approaches for
all these extended circuit switching mechanisms. Moreover, in order to investigate
and optimize TDM-SDM partitioning strategy, we studied the inﬂuence of diﬀerent
link partitioning strategies with the same total wire resources for TDM-SDM CS on
success rate and path length.
Future Research Directions
How to partition the system: The partitioned architecture can enhance the scalability
of centralized systems by dividing the original system into multiple smaller partitions
with multiple local managers. However, how to partition the system to achieve the best
performance, how many partitions do we have to divide for the system? We provide a
preliminary suggestion for this problem in this thesis, but we will study deeply in the
future to ﬁnd the best partitioning strategies.
Important request served ﬁrst: For the NoCM, the incoming connection requests are
stored in a FIFO, and the request that comes ﬁrst is served ﬁrst. However, some requests
may be latency sensitive, and they need to be served ﬁrst. So in the future, we will assign
the requests diﬀerent priority, and the requests with high priority will be served ﬁrst.
Weight the link to balance the traﬃc load: In the network, some links are more
critical. So we can assign diﬀerent weight to diﬀerent links, and the critical links have
higher weight. During the search, the weight along the path is accumulated. At each
state, the incoming path with lowest accumulated weight is selected as survivor path. For
instance, we can assign the link that provides more free time slots low weight, and assign
the link with less free time slots high weight, and thus we can balance the traﬃc load.
Eﬃcient communication mechanism between routers and NoCMs: The commu-
nication between routers and NoCMs is important in centralized system, and sometimes
the communication time may even be longer than the path search time. In this thesis, the
routers send the requests to NoCM via dedicated wires, and the NoCM sends back the
response message to routers as GS packets. The dedicated wire is not scalable (though the
partitioned architecture can mitigate the scalability issue), and the GS packet requires
to allocate the associated GS path. We will try to ﬁnd a more eﬃcient communication
101
mechanism between routers and NoCMs to minimize the resource consumption and com-
munication time in the future.
Hierarchy NoC: In this thesis, we apply the trellis path search algorithm for the ﬂat
2-D mesh. However, according to [WPG10, PNPR07], the clustered and hierarchical NoC
sometimes may provide better performance by shortening the distance between two mod-
ules and adding more bandwidth. In clustered and hierarchical NoC, each cluster can be
a 2D-mesh NoC on its own, and the clusters itself are connected with each other in 2D-
mesh topology. Hence, in large NoC, we can adopt the hierarchy NoC. In this case, each
cluster can be managed by one manager, and managers are connected with each other in
2D-mesh topology, which may mitigate the scalability issue.
Folded bidirectional structure: In this thesis, the bidirectional TESSA is implemented
as unfolded structure. However, it can also be implemented as folded structure. In this
case, each of the two sides is implemented as folded structure. In a cycle, if the searches
from two sides reach the middle stage, but do not meet at any node, at the next cycle, the
search from source node will go back to the initial stage and the search from destination
node will go to the last stage, and continue the search. The search will be stopped until the
searches from two sides meet at the middle stage or exceed limited cycles. The half-folded
or quarter-folded structures can also be applied to bidirectional structure.
Release some reserved connections for the important request: Sometimes the
connection allocation for the important request may fail because there is no available
resource. In this case, we may release some low priority connections for the successful
allocation of the important request.
Joint optimization for BE and GS: In this thesis, we mainly focus on GS traﬃc. In
the future, we would pay more attention to do joint optimization for BE and GS.
102 6 Conclusions and Future Work
Bibliography
[BCGK04] Evgeny Bolotin, Israel Cidon, Ran Ginosar, and Avinoam Kolodny. Qnoc:
Qos architecture and design process for network on chip. Journal of systems
architecture, 50(2):105–128, 2004.
[BDM02] Luca Benini and Giovanni De Micheli. Networks on chip: a new paradigm for
systems on chip design. In Design, Automation and Test in Europe Conference
and Exhibition, 2002. Proceedings, pages 418–419. IEEE, 2002.
[BHM77] Stephen Bradley, Arnoldo Hax, and Thomas Magnanti. Applied mathematical
programming. 1977.
[BS05] Tobias Bjerregaard and Jens Sparso. A router architecture for connection-
oriented service guarantees in the mango clockless network-on-chip. In Design,
Automation and Test in Europe, pages 1226–1231. IEEE, 2005.
[CMF16a] Yong Chen, Emil Matus, and Gerhard P Fettweis. Centralized parallel multi-
path multi-slot allocation approach for tdm nocs. In Electrical and Computer
Engineering (CCECE’16), 2016 IEEE Canadian Conference on, pages 1–5.
IEEE, 2016.
[CMF16b] Yong Chen, Emil Matus, and Gerhard P Fettweis. Trellis-search based dy-
namic multi-path connection allocation for tdm-nocs. In Great Lakes Sym-
posium on VLSI (GLSVLSI’16), 2016 International, pages 323–328. ACM,
2016.
[CMF17a] Yong Chen, Emil Matus, and Gerhard P Fettweis. Combined centralized and
distributed connection allocation in large tdm circuit switching nocs. In Great
Lakes Symposium on VLSI(GLSVLSI’17), 2017 International. ACM, 2017.
[CMF17b] Yong Chen, Emil Matus, and Gerhard P Fettweis. Combined packet and tdm
circuit switching nocs with novel connection conﬁguration mechanism. In
Circuits and Systems (ISCAS’17), 2017 IEEE International Symposium on.
IEEE, 2017.
103
104 Bibliography
[CMF17c] Yong Chen, Emil Matus, and Gerhard P Fettweis. Combined tdm and sdm
circuit switching nocs with dedicated connection allocator. In IEEE Annual
Symposium on VLSI(ISVLSI’17). IEEE, 2017.
[CMF17d] Yong Chen, Emil Matus, and Gerhard P Fettweis. Register-exchange based
connection allocator for circuit switching nocs. In Parallel, Distributed,
and Network-Based Processing (PDP’17), 2017 25th Euromicro International
Conference on. IEEE, 2017.
[CMMF] Yong Chen, Emil Matus, Sadia Moriam, and Gerhard P Fettweis. High per-
formance dynamic resource allocation for guaranteed service in network-on-
chips. IEEE Transactions on Emerging Topics in Computing, submitted.
[CTSM13] Kazem Cheshmi, Jelena Trajkovic, Mohammadreza Soltaniyeh, and Siamak
Mohammadi. Quota setting router architecture for quality of service in gals
noc. In 2013 International Symposium on Rapid System Prototyping (RSP),
pages 44–50. IEEE, 2013.
[DT04] William James Dally and Brian Patrick Towles. Principles and practices of
interconnection networks. Elsevier, 2004.
[EJ13] Ahsen Ejaz and Axel Jantsch. Costs and beneﬁts of ﬂexibility in spatial
division circuit switched networks-on-chip. In Proceedings of the Sixth In-
ternational Workshop on Network on Chip Architectures, pages 41–46. ACM,
2013.
[Fet95] Gerhard Fettweis. Algebraic survivor memory management design for viterbi
detectors. IEEE Transactions on communications, 43(9):2458–2463, 1995.
[GDR05] Kees Goossens, John Dielissen, and Andrei Radulescu. Æthereal network on
chip: concepts, architectures, and implementations. IEEE Design & Test of
Computers, 22(5):414–421, 2005.
[GEEK11] Fayez Gebali, Haytham Elmiligi, and Mohamed Watheq El-Kharashi.
Networks-on-chips: theory and practice. CRC press, 2011.
[GH10] Kees Goossens and Andreas Hansson. The aethereal network on chip after ten
years: Goals, evolution, lessons, and future. In Design Automation Conference
(DAC), 2010 47th ACM/IEEE, pages 306–311. IEEE, 2010.
[GHKM11] Boris Grot, Joel Hestness, Stephen W Keckler, and Onur Mutlu. Kilo-noc: a
heterogeneous network-on-chip architecture for scalability and service guar-
antees. In ACM SIGARCH Computer Architecture News, volume 39, pages
401–412. ACM, 2011.
Bibliography 105
[HCG07] Andreas Hansson, Martijn Coenen, and Kees Goossens. Channel trees: re-
ducing latency by sharing time slots in time-multiplexed networks on chip.
In Proceedings of the 5th IEEE/ACM international conference on Hard-
ware/software codesign and system synthesis, pages 149–154. ACM, 2007.
[Hei14] Jan Heisswolf. A Scalable and Adaptive Network on Chip for Many-Core
Architectures. PhD thesis, Karlsruhe, Karlsruher Institut für Technologie
(KIT), Diss., 2014, 2014.
[HG07] Andreas Hansson and Kees Goossens. Trade-oﬀs in the conﬁguration of a
network on chip for multiple use-cases. In Networks-on-Chip, 2007. NOCS
2007. First International Symposium on, pages 233–242. IEEE, 2007.
[HG10] Andreas Hansson and Kees Goossens. On-chip interconnect with aelite: com-
posable and predictable systems. Springer Science & Business Media, 2010.
[HSG09] Andreas Hansson, Mahesh Subburaman, and Kees Goossens. aelite: A ﬂit-
synchronous network on chip with composable and predictable services. In
Proceedings of the conference on design, automation and test in Europe, pages
250–255. European Design and Automation Association, 2009.
[JPL08] Natalie D Enright Jerger, Li-Shiuan Peh, and Mikko H Lipasti. Circuit-
switched coherence. In Proceedings of the second ACM/IEEE international
symposium on networks-on-chip, pages 193–202. IEEE Computer Society,
2008.
[Kam07] Matthias Kamuf. Trellis decoding: From algorithm to ﬂexible architectures.
Series of licentiate and doctoral theses, 2007.
[KS14] Evangelia Kasapaki and Jens Sparsø. Argo: A time-elastic time-division-
multiplexed noc using asynchronous routers. In Asynchronous Circuits and
Systems (ASYNC), 2014 20th IEEE International Symposium on, pages 45–
52. IEEE, 2014.
[LJ08] Zhonghai Lu and Axel Jantsch. Tdm virtual-circuit conﬁguration for network-
on-chip. IEEE Transactions on Very Large Scale Integration (VLSI) Systems,
16(8):1021–1034, 2008.
[LJL12] Shaoteng Liu, Axel Jantsch, and Zhonghai Lu. Parallel probing: Dynamic
and constant time setup procedure in circuit switching noc. In Proceedings of
the Conference on Design, Automation and Test in Europe, pages 1289–1294.
EDA Consortium, 2012.
[LJL14a] Shaoteng Liu, Axel Jantsch, and Zhonghai Lu. A fair and maximal allocator
for single-cycle on-chip homogeneous resource allocation. IEEE Transactions
on Very Large Scale Integration (VLSI) Systems, 22(10):2230–2234, 2014.
106 Bibliography
[LJL14b] Shaoteng Liu, Axel Jantsch, and Zhonghai Lu. Parallel probe based dynamic
connection setup in tdm nocs. In Proceedings of the conference on Design,
Automation & Test in Europe, page 239. European Design and Automation
Association, 2014.
[LJL15] Shaoteng Liu, Axel Jantsch, and Zhonghai Lu. Multics: Circuit switched noc
with multiple sub-networks and sub-channels. Journal of Systems Architec-
ture, 61(9):423–434, 2015.
[LKFF12] Shu Lin, Tadao Kasami, Toru Fujiwara, and Marc Fossorier. Trellises and
trellis-based decoding algorithms for linear block codes, volume 443. Springer
Science & Business Media, 2012.
[LL11] Angelo Kuti Lusala and Jean-Didier Legat. Combining sdm-based circuit
switching with packet switching in a noc for real-time applications. In
2011 IEEE International Symposium of Circuits and Systems (ISCAS), pages
2505–2508. IEEE, 2011.
[LL12a] Angelo Kuti Lusala and Jean-Didier Legat. Combining sdm-based circuit
switching with packet switching in a router for on-chip networks. International
Journal of Reconﬁgurable Computing, 2012, 2012.
[LL12b] Angelo Kuti Lusala and Jean-Didier Legat. A sdm-tdm-based circuit-switched
router for on-chip networks. ACM Transactions on Reconﬁgurable Technology
and Systems (TRETS), 5(3):15, 2012.
[LMV+08] Anthony Leroy, Dragomir Milojevic, Diederik Verkest, Frédéric Robert, and
Francky Catthoor. Concepts and implementation of spatial division multi-
plexing for guaranteed throughput in networks-on-chip. IEEE Transactions
on Computers, 57(9):1182–1195, 2008.
[Lou95] H-L Lou. Implementing the viterbi algorithm. IEEE Signal processing mag-
azine, 12(5):42–52, 1995.
[MBD+05] Théodore Marescaux, B Bricke, P Debacker, Vincent Nollet, and Henk Cor-
poraal. Dynamic time-slot allocation for qos enabled networks on chip. In
3rd Workshop on Embedded Systems for Real-Time Multimedia, 2005., pages
47–52. IEEE, 2005.
[MF16] Sadia Moriam and Gerhard P Fettweis. Fault tolerant deadlock-free adaptive
routing algorithms for hexagonal networks-on-chip. In Digital System Design
(DSD), 2016 Euromicro Conference on, pages 131–137. IEEE, 2016.
[MGK14] Usman Mazhar Mirza, Flavius Gruian, and Krzysztof Kuchcinski. Mapping
streaming applications on multiprocessors with time-division-multiplexed
network-on-chip. Computers & Electrical Engineering, 40(8):276–291, 2014.
Bibliography 107
[MMB07] Orlando Moreira, Jacob Jan-David Mol, and Marco Bekooij. Online resource
management in a multiprocessor with a network-on-chip. In Proceedings of
the 2007 ACM symposium on Applied computing, pages 1557–1564. ACM,
2007.
[MNTJ04] Mikael Millberg, Erland Nilsson, Rikard Thid, and Axel Jantsch. Guaranteed
bandwidth using looped containers in temporally disjoint networks within the
nostrum network on chip. In Design, Automation and Test in Europe Con-
ference and Exhibition, 2004. Proceedings, volume 2, pages 890–895. IEEE,
2004.
[MSAA09] Mehdi Modarressi, Hamid Sarbazi-Azad, and Mohammad Arjomand. A hy-
brid packet-circuit switched on-chip network based on sdm. In Proceedings of
the Conference on Design, Automation and Test in Europe, pages 566–569.
European Design and Automation Association, 2009.
[MTSA10] Mehdi Modarressi, Arash Tavakkol, and Hamid Sarbazi-Azad. Virtual point-
to-point connections for nocs. IEEE Transactions on Computer-Aided Design
of Integrated Circuits and Systems, 29(6):855–868, 2010.
[PMM15] Farhad Pakdaman, Abbas Mazloumi, and Mehdi Modarressi. Integrated
circuit-packet switching noc with eﬃcient circuit setup mechanism. The Jour-
nal of Supercomputing, 71(8):2787–2807, 2015.
[PNPR07] Christoph Puttmann, Jorg-Christian Niemann, Mario Porrmann, and Ul-
rich Ruckert. Giganoc-a hierarchical network-on-chip for scalable chip-
multiprocessors. In Digital System Design Architectures, Methods and Tools,
2007. DSD 2007. 10th Euromicro Conference on, pages 495–502. IEEE, 2007.
[RRRM08] Crispin Gomez Requena, Maria Engracia Gomez Requena, Pedro Lopez Ro-
driguez, and Jose Duato Marin. Exploiting wiring resources on interconnec-
tion network: increasing path diversity. In Parallel, Distributed and Network-
Based Processing, 2008. PDP 2008. 16th Euromicro Conference on, pages
20–29. IEEE, 2008.
[SBSK12] Martin Schoeberl, Florian Brandner, Jens Sparsø, and Evangelia Kasapaki.
A statically scheduled time-division-multiplexed network-on-chip for real-time
systems. In Networks on Chip (NoCS), 2012 Sixth IEEE/ACM International
Symposium on, pages 152–160. IEEE, 2012.
[SG11a] Radu Stefan and Kees Goossens. An improved algorithm for slot selection in
the æthereal network-on-chip. In Proceedings of the Fifth International Work-
shop on Interconnection Network Architecture: On-Chip, Multi-Chip, pages
7–10. ACM, 2011.
108 Bibliography
[SG11b] Radu Stefan and Kees Goossens. A tdm slot allocation ﬂow based on mul-
tipath routing in nocs. Microprocessors and Microsystems, 35(2):130–138,
2011.
[Sha15] Liu Shaoteng. New circuit switching techniques in on-chip networks. PhD
thesis, KTH Royal Institute of Technology, 2015.
[SKS13] Jens Sparsø, Evangelia Kasapaki, and Martin Schoeberl. An area-eﬃcient
network interface for a tdm-based network-on-chip. In Proceedings of the
Conference on Design, Automation and Test in Europe, pages 1044–1047.
EDA Consortium, 2013.
[SLB07] Christian Schuck, Stefan Lamparth, and Jurgen Becker. artnoc-a novel multi-
functional router architecture for organic computing. In 2007 International
Conference on Field Programmable Logic and Applications, pages 371–376.
IEEE, 2007.
[SMG14] Radu Andrei Stefan, Anca Molnos, and Kees Goossens. daelite: A tdm noc
supporting qos, multicast, and fast connection set-up. IEEE Transactions on
Computers, 63(3):583–594, 2014.
[SNG12] Radu Stefan, Ashkan Beyranvand Nejad, and Kees Goossens. Online allo-
cation for contention-free-routing nocs. In Proceedings of the 2012 Intercon-
nection Network Architecture: On-Chip, Multi-Chip Workshop, pages 13–16.
ACM, 2012.
[Ste12] RA Stefan. Resource allocation in time-division-multiplexed networks on chip.
TU Delft, Delft University of Technology, 2012.
[WF08] Markus Winter and Gerhard P Fettweis. A network-on-chip channel allocator
for run-time task scheduling in multi-processor system-on-chips. In Digital
System Design Architectures, Methods and Tools, 2008. DSD’08. 11th EU-
ROMICRO Conference on, pages 133–140. IEEE, 2008.
[WF11] Markus Winter and Gerhard P Fettweis. Guaranteed service virtual channel
allocation in nocs for run-time task scheduling. In 2011 Design, Automation
& Test in Europe, pages 1–6. IEEE, 2011.
[WL03] Daniel Wiklund and Dake Liu. Socbus: Switched network on chip for hard real
time embedded systems. In Parallel and Distributed Processing Symposium,
2003. Proceedings. International, pages 8–pp. IEEE, 2003.
[WPG10] Markus Winter, Steﬀen Prusseit, and P Fettweis Gerhard. Hierarchical rout-
ing architectures in clustered 2d-mesh networks-on-chip. In SoC Design Con-
ference (ISOCC), 2010 International, pages 388–391. IEEE, 2010.
Bibliography 109
[WSRS05] Pascal T Wolkotte, Gerard JM Smit, Gerard K Rauwerda, and Lodewijk T
Smit. An energy-eﬃcient reconﬁgurable circuit-switched network-on-chip. In
Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE
International, pages 8–pp. IEEE, 2005.
[YKH10] Zhiyao Joseph Yang, Akash Kumar, and Yajun Ha. An area-eﬃcient dynam-
ically reconﬁgurable spatial division multiplexing network-on-chip with static
throughput guarantee. In Field-Programmable Technology (FPT), 2010 In-
ternational Conference on, pages 389–392. IEEE, 2010.
[YZSZ14] Jieming Yin, Pingqiang Zhou, Sachin S Sapatnekar, and Antonia Zhai.
Energy-eﬃcient time-division multiplexed hybrid-switched noc for heteroge-
neous multicore systems. In Parallel and Distributed Processing Symposium,
2014 IEEE 28th International, pages 293–303. IEEE, 2014.
110 Bibliography
