CROSS-LAYER DESIGN, OPTIMIZATION

AND PROTOTYPING OF NoCs FOR THE

NEXT GENERATION OF

HOMOGENEOUS MANY-CORE SYSTEMS by TATENGUEM FANKEM, Herve'
University of Ferrara
Engineering Department of the University of Ferrara
Doctorate Degree in Science of Engineering
Coordinator: Prof. Stefano Trillo
Cycle: XXVI
CROSS-LAYER DESIGN, OPTIMIZATION
AND PROTOTYPING OF NoCs FOR THE
NEXT GENERATION OF
HOMOGENEOUS MANY-CORE SYSTEMS
ING-INF/01
Candidate: Advisor:
Herve´ Tatenguem Fankem Prof. Davide Bertozzi
Academic Year 2011/2013
.
Acknowledgements
First of all I would like to thank my advisor Prof. Davide Bertozzi. He has
taught me, how to do a good research. I appreciate all his contributions of
time, ideas, and funding to make my Ph.D. experience productive and stim-
ulating. The joy and enthusiasm he has for his research was motivational
for me, even during tough times in the Ph.D. pursuit. The members of the
MPSoCs group have contributed immensely to my personal and professional
skills. The group has been a source of friendships as well as good advice and
collaboration. I am especially grateful to Daniele, Alessandro, Luca, Alberto,
Marco and Gabriele. I would also like to acknowledge Jann Raik and the
Tallinn Institute of Technology where I spent my internship. For this disser-
tation I would like to thank my reading committee members for their time,
interest, and helpful comments. I gratefully acknowledge the funding sources
that made my Ph.D. work possible. I was funded by the E.U. NaNoC and
vIrtical Projects. My time at UNIFE was made enjoyable in large part due
to the many friends and groups that became a part of my life. I am grateful
for time spent with colleagues and friends, for our memorable trips into the
cities. Lastly, I would like to thank my family for all their love and encour-
agement. For my parents who raised me with a love of science and supported
me in all my pursuits. For the presence of my brothers/sister and most of
all for my, supportive, encouraging, patient and loving Reine Flore whose
faithful support me during my Ph.D.
Thank you.
Herve´ Tatenguem Fankem

Contents
Contents i
List of Figures vi
List of Tables 1
Abstract 3
Introduction 5
1 Overview of Two Architectural Variants of a Mesh 9
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 First Architectural Variant of a Mesh . . . . . . . . . . . . . . 9
2.1 Baseline Architecture . . . . . . . . . . . . . . . . . . . 9
2.2 Basic Design Choices for the New Switch . . . . . . . . 10
Fault-Tolerance . . . . . . . . . . . . . . . . . . . . . . 10
Notiﬁcation Interface . . . . . . . . . . . . . . . . . . . 11
Reconﬁgurability . . . . . . . . . . . . . . . . . . . . . 11
2.3 Experimental Results . . . . . . . . . . . . . . . . . . . 12
Complexity Breakdown: Area Results . . . . . . . . . . 12
Complexity Breakdown: Delay Results . . . . . . . . . 13
3 Second Architectural Variant of a Mesh . . . . . . . . . . . . . 14
3.1 Tightly Coupled Dc FIFO . . . . . . . . . . . . . . . . 14
3.2 Input/Output Buﬀers . . . . . . . . . . . . . . . . . . . 16
3.3 Probing System . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Path Shifting . . . . . . . . . . . . . . . . . . . . . . . 16
ii Contents
3.5 Control Path . . . . . . . . . . . . . . . . . . . . . . . 17
4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 17
5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2 Synthesis Flow 21
1 Design ﬂow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 22
2 IC Timing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.1 Timing Margins . . . . . . . . . . . . . . . . . . . . . . 24
3 Fixing hold-time violations . . . . . . . . . . . . . . . . . . . . 26
4 Requirements for an IC timing optimization tool . . . . . . . . 27
5 The Compress Hold Time Tool . . . . . . . . . . . . . . . . . 28
5.1 Hold-time buﬀer insertion . . . . . . . . . . . . . . . . 28
6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.1 Hold Time Robustness:Fine-tuning the ﬂow on a 2D
mesh . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3 Contrasting Multi-Synchronous MPSoC Design Styles for
Fine-Grained Clock Domain Partitioning: the Full-HD Video
Playback Case Study 33
1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4 Synchronizer Architecture . . . . . . . . . . . . . . . . . . . . 38
5 Full-HD Video Playback Requirements . . . . . . . . . . . . . 41
5.1 System conﬁguration . . . . . . . . . . . . . . . . . . . 41
6 Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 46
7.1 Area results . . . . . . . . . . . . . . . . . . . . . . . . 46
7.2 Setting the speed of the mesochronous NoC . . . . . . 47
7.3 Power results . . . . . . . . . . . . . . . . . . . . . . . 50
8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Contents iii
4 Mesochronous NoC Technology for Power-Eﬃcient GALS
MPSoCs: Mesochronous vs. Synchronous 55
1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3 Target GALS Architecture . . . . . . . . . . . . . . . . . . . . 59
4 Hybrid coupling of synchronizer with the NoC . . . . . . . . . 61
5 Synthesis of GALS Platforms . . . . . . . . . . . . . . . . . . 62
6 Experimental results . . . . . . . . . . . . . . . . . . . . . . . 64
Area and Wiring Overhead . . . . . . . . . . . . . . . . 64
Power analysis . . . . . . . . . . . . . . . . . . . . . . 65
7 Conclusions and Discussion . . . . . . . . . . . . . . . . . . . 66
5 Testing Archicteture on Top of The First Variant of the
Mesh 69
1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3 Baseline Architecture . . . . . . . . . . . . . . . . . . . . . . . 72
4 Basic Design Choices for the New Switch . . . . . . . . . . . . 72
Fault-Tolerance . . . . . . . . . . . . . . . . . . . . . . 73
Notiﬁcation Interface . . . . . . . . . . . . . . . . . . . 73
Reconﬁgurability . . . . . . . . . . . . . . . . . . . . . 73
Boot-Time Testing Architecture . . . . . . . . . . . . . 74
5 Cross-Feature Optimizations . . . . . . . . . . . . . . . . . . . 75
6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 78
Complexity Breakdown: Area Results . . . . . . . . . . 78
Complexity Breakdown: Delay Results . . . . . . . . . 80
Coverage for single stuck-at faults . . . . . . . . . . . . 80
7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6 Testing Architecture on Top of the Second Variant of the
Mesh 83
1 Switch Architecture . . . . . . . . . . . . . . . . . . . . . . . . 83
1.1 Tightly Coupled Dc FIFO . . . . . . . . . . . . . . . . 84
1.2 Input/Output Buﬀers . . . . . . . . . . . . . . . . . . . 84
1.3 Probing System . . . . . . . . . . . . . . . . . . . . . . 85
iv Contents
1.4 Error correction . . . . . . . . . . . . . . . . . . . . . . 86
1.5 Path Shifting . . . . . . . . . . . . . . . . . . . . . . . 87
1.6 Control Path . . . . . . . . . . . . . . . . . . . . . . . 88
2 Testing Methodology . . . . . . . . . . . . . . . . . . . . . . . 88
3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 90
7 Ultra-Low Latency NoC testing via Pseudo-Random Test
Pattern Compaction 93
1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4 Testing methodology . . . . . . . . . . . . . . . . . . . . . . . 97
5 Baseline Switch Architecture . . . . . . . . . . . . . . . . . . . 99
6 Testing Architecture . . . . . . . . . . . . . . . . . . . . . . . 99
6.1 LBDR testing . . . . . . . . . . . . . . . . . . . . . . . 100
6.2 Arbiter testing . . . . . . . . . . . . . . . . . . . . . . 101
6.3 Output buﬀer testing . . . . . . . . . . . . . . . . . . . 102
6.4 Input buﬀer testing . . . . . . . . . . . . . . . . . . . . 103
6.5 Testing Multiplexers of the Crossbar . . . . . . . . . . 104
6.6 Testing Infrastructure Optimization . . . . . . . . . . . 104
7 Experimental results . . . . . . . . . . . . . . . . . . . . . . . 106
7.1 Area Overhead . . . . . . . . . . . . . . . . . . . . . . 106
7.2 Testing time . . . . . . . . . . . . . . . . . . . . . . . . 108
7.3 Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . 110
8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
8 Cost-eﬀective Contention Avoidance in a CMP with Shared
Memory Controllers 113
1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
3 NoC Background and Related work . . . . . . . . . . . . . . . 116
4 NoC Congestion . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.1 NoC Contention . . . . . . . . . . . . . . . . . . . . . . 117
4.2 Application Performance Relative to Memory Controller
Location . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Contents v
5 NoC Congestion Control . . . . . . . . . . . . . . . . . . . . . 119
6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.1 System conﬁguration . . . . . . . . . . . . . . . . . . . 121
6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.3 Hardware breakdown . . . . . . . . . . . . . . . . . . . 125
7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
9 Final FPGA Prototyping of Homogeneous Multicores. 129
1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
3 FPGA Platform . . . . . . . . . . . . . . . . . . . . . . . . . . 131
4 The System Under Test . . . . . . . . . . . . . . . . . . . . . 133
5 Basic components: the on-chip network . . . . . . . . . . . . . 136
5.1 The Network Interfaces . . . . . . . . . . . . . . . . . . 136
6 Basic components: the supervision subsystem . . . . . . . . . 137
7 Basic components: the reconﬁguration algorithm . . . . . . . . 139
8 The application . . . . . . . . . . . . . . . . . . . . . . . . . . 139
9 The physical platform implementation . . . . . . . . . . . . . 141
10 Validating Built-in Self-Testing and NoC conﬁguration . . . . 141
11 Validating Fault Detection and NoC Reconﬁguration . . . . . 142
12 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
10 Power Characterization of Optical NoC Interfaces 151
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
2 Optical Network Interface Architecture . . . . . . . . . . . . . 151
3 Power Characterization . . . . . . . . . . . . . . . . . . . . . . 153
4 Analysis and Discussion . . . . . . . . . . . . . . . . . . . . . 154
5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Bibliography 161
vi Contents
List of Figures
1.1 First variant: NACK/GO switch architecture. . . . . . . . . . 12
1.2 Area analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3 Routing delay analysis. . . . . . . . . . . . . . . . . . . . . . . 14
1.4 Dual-Clock FIFO integration into one input port of the switch
architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5 The Second architectural variant at a glance. . . . . . . . . . 18
1.6 Area overhead @500MHz. . . . . . . . . . . . . . . . . . . . . 19
1.7 Testing overhead and area Breakdown. . . . . . . . . . . . . . 19
1.8 Normalized routing delay @Max Performance. . . . . . . . . . 20
2.1 Increasing clock tree depth and increasing on-chip variability
causes increasing system-level timing uncertainty. . . . . . . . 23
2.2 Increasing clock tree depth and increasing on-chip variability
causes increasing system-level timing uncertainty. . . . . . . . 25
2.3 Increasing clock tree depth and increasing on-chip variability
causes increasing system-level timing uncertainty. . . . . . . . 26
2.4 Increasing clock tree depth and increasing on-chip variability
causes increasing system-level timing uncertainty. . . . . . . . 26
3.1 Plain multi-synchronous architecture based on dual-clock FIFOs. 37
3.2 Mesochronous synchronization in a multi-synchronous archi-
tecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 Architecture of synchronizers. . . . . . . . . . . . . . . . . . . 42
3.4 Communication bandwidth requirements for full-HD video play-
back: 1920x1080 pixel, 60 frames/s, true color. . . . . . . . . . 43
viii List of Figures
3.5 Area comparison between multi-synchronous NoC implemen-
tation variants when varying the FIFO depth. . . . . . . . . . 47
3.6 Determining the speed of the mesochronous NoC for theArch.1
and Arch.2 settings. . . . . . . . . . . . . . . . . . . . . . . . 48
3.7 Power consumption of Arch.1 system conﬁgurations. . . . . . 50
3.8 Power consumption of Arch.2 system conﬁgurations. . . . . . 51
3.9 Power consumption of Arch.3 system conﬁgurations. . . . . . 51
4.1 Plain multi-synchronous architecture based on dual-clock FIFOs. 58
4.2 Mesochronous synchronization in a multi-synchronous archi-
tecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 Hybrid mesochronous synchronizer architecture. . . . . . . . . 59
4.4 Power consumption with no activity and with uniform random
traﬃc (normalized with respect to the mesochronous network). 63
4.5 Area and wiring intricacy (normalized with respect to the
mesochronous network). . . . . . . . . . . . . . . . . . . . . . 64
5.1 NACK/GO switch architecture. . . . . . . . . . . . . . . . . . 78
5.2 BIST-enhanced switch architecture. . . . . . . . . . . . . . . . 79
5.3 Interdependency Diagram between Reconﬁguration, Fault-tolerance,
Testing and Notiﬁcation. . . . . . . . . . . . . . . . . . . . . . 80
5.4 Area analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.5 Routing delay analysis. . . . . . . . . . . . . . . . . . . . . . . 82
6.1 Dual-Clock FIFO integration into one input port of the switch
architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.2 The Second architectural Variant of the Mesh at a glance. . . 88
6.3 Testing Architecture. In green, the test wrapper is pointed out. 89
6.4 Area overhead @500MHz. . . . . . . . . . . . . . . . . . . . . 90
6.5 Testing overhead and area Breakdown. . . . . . . . . . . . . . 91
6.6 Normalized routing delay @Max Performance. . . . . . . . . . 91
7.1 Baseline Switch Architecture. . . . . . . . . . . . . . . . . . . 100
7.2 Lbdr Testing Architecture. . . . . . . . . . . . . . . . . . . . . 101
7.3 Arbiter Testing Architecture. . . . . . . . . . . . . . . . . . . 102
7.4 Output Buﬀer Testing Architecture. . . . . . . . . . . . . . . . 103
List of Figures ix
7.5 Testing Architecture for Crossbar Multiplexers. . . . . . . . . 104
7.6 Cascaded Testing Architecture. . . . . . . . . . . . . . . . . . 105
7.7 An Electron . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
8.1 CMP tile-based design with dynamic application domains . . . 114
8.2 Basic switch architecture . . . . . . . . . . . . . . . . . . . . . 115
8.3 Switch architecture . . . . . . . . . . . . . . . . . . . . . . . . 121
8.4 32 concurrent applications mapped on the system . . . . . . . 123
8.5 32 concurrent applications mapped on the system (MSE be-
tween injected and accepted traﬃc) . . . . . . . . . . . . . . . 125
8.6 32 concurrent applications mapped on the system (Averaged
network throughput, ocean workload) . . . . . . . . . . . . . . 126
8.7 Execution time distribution, 1 memory controller, ocean work-
load (With virtual channels) . . . . . . . . . . . . . . . . . . . 127
8.8 Execution time distribution, 1 memory controller, ocean work-
load (No virtual channels) . . . . . . . . . . . . . . . . . . . . 128
8.9 Switch area at 500 MHz . . . . . . . . . . . . . . . . . . . . . 128
9.1 VC707 baseline prototyping board. . . . . . . . . . . . . . . . 131
9.2 FPGA platform overview . . . . . . . . . . . . . . . . . . . . . 133
9.3 Basic components of the on-chip network. . . . . . . . . . . . 145
9.4 Design ﬂow for platform implementation. . . . . . . . . . . . . 145
9.5 Built-In-Self-Testing at work (a). . . . . . . . . . . . . . . . . 146
9.6 Built-In-Self-Testing at work (b). . . . . . . . . . . . . . . . . 146
9.7 Built-In-Self-Testing at work (c). . . . . . . . . . . . . . . . . 147
9.8 Built-In-Self-Testing at work (d). . . . . . . . . . . . . . . . . 147
9.9 Transient fault detection and reconﬁguration (a). . . . . . . . 148
9.10 Transient fault detection and reconﬁguration (b). . . . . . . . 148
9.11 Transient fault detection and reconﬁguration (c). . . . . . . . 149
9.12 Transient fault detection and reconﬁguration (d). . . . . . . . 149
10.1 Optical Network Interface Architecture. . . . . . . . . . . . . . 153
10.2 Static Power of Electronic Network Interface vs. Optical Net-
work Interface@//3. . . . . . . . . . . . . . . . . . . . . . . . . 156
x List of Figures
10.3 Static Power of Electronic Network Interface vs. Optical Net-
work Interface@//4. . . . . . . . . . . . . . . . . . . . . . . . . 156
10.4 Total Static Power of Electronic Network vs. Optical Network
@//3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
10.5 Total Static Power of Electronic Network vs. Optical Network
@//4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
List of Tables
2.1 Worst slack on links: unoptimized 4x4 mesh vs. HTR1 vs. HTR2. 31
2.2 Total slack on links: unoptimized 4x4 mesh vs. HTR1 vs. HTR2. 32
2.3 Total slack on whole design: unoptimized 4x4 mesh vs. HTR1
vs. HTR2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4 Total Area: 4x4 mesh vs. HTR1 vs. HTR2. . . . . . . . . . . . 32
3.1 Range for IP core speed and chosen settings. . . . . . . . . . . 45
5.1 Coverage for single stuck-at faults. . . . . . . . . . . . . . . . 81
6.1 Coverage for single stuck-at faults. . . . . . . . . . . . . . . . 92
7.1 Test Application Time Per Block . . . . . . . . . . . . . . . . 108
7.2 Testing Cycles as function of the Testing Approach . . . . . . 108
7.3 Test application time and coverage of diﬀerent testing methods 109
7.4 Coverage as function of the Testing Approach . . . . . . . . . 110
7.5 Compaction Table . . . . . . . . . . . . . . . . . . . . . . . . 110
8.1 CMP conﬁguration. . . . . . . . . . . . . . . . . . . . . . . . . 122
9.1 Resource utilization of the Virtex 7 chip. . . . . . . . . . . . . 144
10.1 Static and Dynamic Power Of Electronic Devices. . . . . . . . 154
2 List of Tables
Abstract
This thesis provides a whole set of design methods to enable and manage the
runtime heterogeneity of features-rich industry-ready Tile-Based Network-
on-Chips at diﬀerent abstraction layers (Architecture Design, Network As-
sembling, Testing of NoC, Runtime Operation). The key idea is to maintain
the functionalities of the original layers, and to improve the performance
of architectures by allowing, joint optimization and layer coordinations. In
general purpose systems, we address the microarchitectural challenges by co-
designing and co-optimizing feature-rich architectures. In application-speciﬁc
NoCs, we emphasize the event notiﬁcation, so that the platform is contin-
uously under control. At the network assembly level, this thesis proposes a
Hold Time Robustness technique, to tackle the hold time issue in synchronous
NoCs. At the network architectural level, the choice of a suitable synchro-
nization paradigm requires a boost of synthesis ﬂow as well as the coexistence
with the DVFS. On one hand this implies the coexistence of mesochronous
synchronizers in the network with dual-clock FIFOs at network boundaries.
On the other hand, dual-clock FIFOs may be placed across inter-switch links
hence removing the need for mesochronous synchronizers. This thesis will
study the implications of the above approaches both on the design ﬂow and
on the performance and power quality metrics of the network. Once the many-
core system is composed together, the issue of testing it arises. This thesis
takes on this challenge and engineers various testing infrastructures. At the
upper abstraction layer, the thesis addresses the issue of managing the fully
operational system and proposes a congestion management technique named
HACS. Moreover, some of the ideas of this thesis will undergo an FPGA
prototyping. Finally, we provide some features for emerging technology by
characterizing the power consumption of Optical NoC Interfaces.

Introduction
Today, many-core embedded systems are moving towards the integration of
thousands of cores on a single chip. However, as the number of cores inte-
grated into a chip increases, the on-chip communication tends to become
the performance bottleneck and power hungry. To cope with the increasing
demand for high performance systems, many-core designs rely on integrated
network-on chips and exploit parallelism to make programs run faster. In-
deed, among all feasible solutions that have been proposed to cope with
the on-chip communication infrastructure, Network-on-Chips (NoCs) are the
most viable solution that lead to meet the performance and design productiv-
ity requirements of a complex on-chip communication infrastructure. On one
hand, NoCs provide an infrastructure for better modularity, scalability, fault-
tolerance, and higher bandwidth compared to traditional infrastructures. On
the other hand, developing applications using the full power of NoC-based
many-core embedded systems is not trivial and requires parallel programs.
Moreover, the programs to be run on the many-core chip are of varying degree
parallelized and they may have diﬀerent characteristics regarding processing
needs, memory space and bandwidth. Above all, the eﬃcient exploitation of
the abundant hardware resources will progressively go through the sharing of
such resources among a large number of concurrently executing applications.
The focus is therefore on how to manage the resources in many-core
chips in response to an increasingly complex and resource-sharing
workload, and how to optimize cooperation between system design
layers. There are two main examples of such resource management con-
cern. On one hand, each core or cluster or cores will be operated at diﬀerent
voltages and frequencies for the sake of optimal execution and ultimately of
power management. On the other hand, such splitting will indirectly relieve
6 Introduction
the clock distribution problem in large chips, which cannot be performed
any more under the fully synchronous assumption. The above examples con-
ﬁrm that design layers are not isolated in manycore design, but have deep
cross-layer implications that should be co-optimized together.
Following the same cross-layer vision, the future applications will not be able
to assume that the underlying manycore computation and communication
fabrics are working in their entirety. In fact, there will be an increasing role
of manufacturing faults on system integrity, which calls for the relentless de-
velopment of testing strategies. To ensure the required quality and reliability
of such complex integrated circuits before supplying them to ﬁnal users, ex-
tensive manufacturing tests need to be conducted and (absolute novelty) the
associated test cost may soon account for the same share of the total pro-
duction cost, as the ITRS documents start to point out.
When we combine the above (apparently diﬀerent) issues together, it becomes
evident that their compound eﬀect consists of turning a fully regular and ho-
mogeneous manycore platform into a runtime heterogeneous one. In fact, the
homogeneous design by construction undergoes a diﬀerentiation of operating
conditions, and suﬀer from the regularity-breaking eﬀect of manufacturing
faults. This consideration is at the foundation of this work.
To address all these challenges, this thesis provides a whole set of
design methods to enable and manage the runtime heterogeneity of
features-rich industry-ready Tile-Based Network-on-Chips at dif-
ferent abstraction layers (1st layer: Architecture Design, 2nd layer:
Network Assembling, 3rd layer: Testing of NoC, 4th layer: Runtime
Operation).
The key idea consists to maintain the functionalities associated to the orig-
inal layers, and to improve the performance of architectures by allowing in-
teraction, joint optimization and coordination among the layers (cross-layer
design).
At the architecture layer, the manycore management challenge fundamen-
tally means the extension of current NoC architectures towards increased
ﬂexibility, reconﬁgurability and /or notiﬁcation capability. These terms as-
sume a diﬀerent meaning depending on the NoC application domain. In gen-
eral purpose systems, the microarchitectural challenges consist of augment-
Introduction 7
ing regular tile-based NoCs into systems capable of runtime reconﬁguration
of the routing function, of transient fault notiﬁcation, and of some form of
fault-tolerance capability. This motivates the eﬀort presented in this the-
sis: bringing state-of-the-art NoC architectures into the next generation of
industry-ready NoCs. This essentially goes through the design of feature-rich
architectures where the diﬀerent features are co-designed and co-optimized
together. Vice versa, in application-speciﬁc NoCs the reconﬁgurability re-
quirement is (and will be) still far away, while instead the emphasis will be
on event notiﬁcation, so that the platform is continuously under control.
At the network assembly level, the large size of manycore systems and the un-
predictability of the underlying silicon technology raise unprecedented com-
positional challenges, which are fundamentally physical design and design
technology issues.
For example, clock variability causes timing issues, as the clock skew aﬀects
the timing in two ways: setup-time wise and hold-time wise.
While setup-time issues cause the system to function at reduced performance,
hold-time violations may render a system dysfunctional.
The above problems statistically show up at switch boundaries, since switches
are separated apart in real layouts. Therefore, it is at the switch boundaries
(that is, inter-switch communication) that the above issues should be mainly
addressed. This thesis proposes a HTR (Hold Time Robustness) technique,
that leads to tackle the hold time issue in synchronous NoCs links. At the
network architectural level, manycore design pose the challenge of the choice
of a suitable synchronization paradigm. On one hand, such large systems may
break the fully synchronous assumption by means of mesochronous clocking.
This paradigm today requires a boost of synthesis ﬂow as well as the coexis-
tence with the DVFS (Dynamic Voltage and Frequency Scaling) requirement.
Ultimately, this implies the coexistence of mesochronous synchronizers in the
network with dual-clock FIFOs at network boundaries. On the other hand,
the penetration of dual-clock FIFOs in the design may be much deeper, that
is, they may be placed across inter-switch links hence removing the need for
mesochronous synchronizers. This approach has fundamentally diﬀerent im-
plications both on the design ﬂow and on the performance and power quality
metrics of the network. They will be all studied in this thesis.
8 Introduction
Once the manycore system is composed together, the issue of testing it arises.
This thesis takes on this challenge and engineers a testing infrastructure,
respectively on top of a general purpose and an application speciﬁc NoCs.
Moreover, an ultra-low latency NoCs testing infrastructure is designed for
”online testing” that cannot aﬀord high testing cycles.
At the upper abstraction layer, the thesis addresses the issue of managing the
fully operational system delivered to the end user. At this level, among all the
possible runtime management issues, we focus on the congestion management
problem in two diﬀerent scenarios: In the ﬁrst scenario, we have multiple
physical networks (one global network and one local network composed of
two virtual-channels). In the second scenario, one physical network of 3
VCs or 2 VCs is considered in an attempt to collapse the network. In this sce-
nario, local and global traﬃcs interfere not only on links but also on switches.
To overcome the above issue, this thesis proposes a congestion management
technique named HACS (Head-of-line Avoidance Congestion Skip-ahead), a
head-of-line blocking observation mechanism that allows buﬀered packets to
bypass the packet that is at the head of the queue. Moreover, some of the
ideas of this thesis will undergo an FPGA prototyping. In fact to validate
the industrial-ready NoC, this thesis reports on the prototyping of
a 16-core homogeneous multi-core processor with a faul-tolerant,
runtime reconﬁgurable and dynamically virtualizable on-chip net-
work. The prototyped system will validate the NoC capability of boot-time
testing and conﬁguration, transient or intermittent fault-detection and run-
time reconﬁguration of the routing function. Finally, we provide some features
for emerging technology. My contribution was a key enabler to facilitate the
power characterization of Optical NoC Interfaces, in such a way to be able
to look forward emerging optical interconnect technology. Overall, the the-
sis is a comprehensive contribution to the advance in the ﬁeld of manycore
NoC-based system design.
Chapter 1
Overview of Two Architectural
Variants of a Mesh
1 Introduction
The digital design convergence, together with the new usage models of mo-
bile devices, are raising the clear need for new requirements such as ﬂexible
partitioning, runtime adaptivity and reliability. The above trend has direct
implications on the design of the underlying on-chip network, which becomes
not only the system integration framework, but also the control framework
executing hypervisor commands, or reacting to runtime operating conditions.
The ultimate challenge for the NoC is to co-design these features together.
This chapter takes on this challenge and illustrates two design experiences of
a NoC switch architecture.
2 First Architectural Variant of a Mesh
2.1 Baseline Architecture
The ﬁrst architectural variant (see ﬁg.5.1) proposed in this chapter is a major
extension of the baseline ×pipesLite switch [82], which targets the embedded
computing domain with a very lightweight architecture.
The considered ×pipesLite variant implements logic-based distributed rout-
ing (LBDR): each switch has simple combinational logic that computes target
10 Overview of Two Architectural Variants of a Mesh
output ports from packet destinations and local switch coordinates. By means
of 26 conﬁguration bits for each switch (indicating switch port connectivity,
routing restrictions, and deroutes), the routing function can be reconﬁgured
at runtime[84]. The straightforward yet overly expensive way to make the
baseline switch fault-tolerant is through Triple Modular Redundancy. The
only advantage is that the TMR architecture can aﬀord keeping the native
STALL/GO ﬂow control unmodiﬁed.
2.2 Basic Design Choices for the New Switch
The proposed switch architecture is designed to be the basic building block
of a reconﬁgurable and fault-tolerant NoC. Reconﬁguration is achieved by
means of a global controller implemented in software, which requires com-
mand execution support in hardware. A dual network is therefore designed
to exchange control information between switches and the global controller.
Reconﬁgurability is implemented as runtime modiﬁcation of the routing func-
tion, in order to provide not only ﬂexible network partitioning when several
applications are concurrently executed, but also to avoid faulty links/regions
of the network. This latter functionality requires that points of failure are ﬁrst
detected, both at boot and run time, then notiﬁed to the system manager,
that triggers the reconﬁguration accordingly.
Fault-Tolerance
Whether a fault-tolerance switching strategy should aﬀect the ﬂow control
protocol or not is a major design choice with high impact on the overall switch
architecture. The work in[87] derives error recovery strategies for the same
NoC switch both from a ﬂow control protocol with error notiﬁcation capabil-
ity (NACK/GO) and from another one lacking this support (STALL/GO).
Data retransmission is used in the former case, while the latter one can only
rely on error correction. It has been shown that NACK/GO potentially re-
sults in shorter critical path, more conservative area and lower peak power, at
the cost of a slight average power overhead. This led us to opt for NACK/GO
for the proposed switch architecture (Figure 5.1). The proposed solution tar-
gets single event upsets (SEUs). In the data path, detectors trigger ﬂit re-
1.2.2 Basic Design Choices for the New Switch 11
transmissions from the sender buﬀer, which is preceded by correction of the
stored ﬂit in case it were corrupted in the buﬀer. On the control path, FSMs
are triplicated to avoid their permanent misalignment, while routing and ar-
bitration logic is just doubled, since dual-rail checkers (DRCs) can trigger
retransmissions from the input buﬀer upon mismatch detection.
Notiﬁcation Interface
In this chapter, we opt for a centralized approach to network control: a global
manager is in charge of network reconﬁguration decisions as an eﬀect of fault-
tolerance, power management or virtualization strategies. In order to address
the need for control signaling between network nodes and the global manager,
we revert to the dual communication infrastructure proposed in[89], where
the main NoC is extended with a ring which connects all the switches of the
main NoC together. The ring implementation implies the extension of each
switch with a simple routing primitive, which is an oversimpliﬁed version of
an input buﬀered switch.
Reconﬁgurability
The reconﬁguration mechanism of the routing function in the presence of
background traﬃc should provide deadlock freedom during the transition
from one routing algorithm to another, when extra dependencies may arise
and lead to deadlock. To cope with this issue, the ﬁrst switch variant lever-
ages Overlapped Static Reconﬁgurations (OSR), a technique which avoids
draining network traﬃc[88]. OSR was ﬁrst proposed for oﬀ-chip networks,
and its customization for a much more resource-constrained on-chip setting,
named OSR-Lite, has been performed in[83]. The basic principle is the fol-
lowing: if packets with the old routing function are guaranteed to never go
behind packets using the new routing function, then no deadlock cycles can
occur. In OSR this is achieved by triggering a token that separates old pack-
ets from new ones. The token advances through the network hop by hop,
following the channel dependency graph of the old routing function, and
progressively drains the network from old packets, allowing new packets to
enter the network at routers where the token already passed.
12 Overview of Two Architectural Variants of a Mesh
Figure 1.1: First variant: NACK/GO switch architecture.
2.3 Experimental Results
All the logic synthesis runs performed in this work have been carried out by
means of a low-power standard-Vth 40nm Inﬁneon technology library.
Complexity Breakdown: Area Results
The following experiment points out both the complexity gap between the
native xpipesLite switch and the feature-rich extended one, and the area
increment that each integrated switch feature contributes. Normalized area
results are shown in ﬁgure 5.4, where features are incrementally added to the
baseline switch. This and its TMR extension are reported as reference design
points. Fault-tolerance is clearly the highest-impact feature. A non-negligible
area contribution comes in fact from detector and corrector modules and ac-
counts for almost 13% of the total area. When the reconﬁguration mecha-
nism is integrated into the NACK/GO switch, an 11% of area overhead is
introduced. The notiﬁcation system (TMR-protected dual network) results
lightweight (5% area overhead) since it takes advantages of the diagnosis
logic already made available for fault-tolerance purposes. Finally, the switch
capable of built-in self-test and self-diagnosis brings a 27% of area overhead
(Chapter 5 describes the testing architecture on top of the ﬁrst variant),
1.2.3 Experimental Results 13
which is the second major source of complexity after fault-tolerance. When
we consider a baseline TMR switch which implements only fault-tolerance
on top of a baseline xpipesLite switch, we can see that the proposed switch
(rightmost bar in the plot) provides many more features at comparable area
footprint.
Complexity Breakdown: Delay Results
In order to evaluate the eﬀects of each additional feature on the switch propa-
gation delay, we performed a 5x5 switch synthesis for maximum performance
for all the 5 incremental solutions under test. Results are reported in Fig.5.5.
The fault-tolerant NACK/GO switch, the switch with OSR-Lite mechanism
and the switch with notiﬁcation system achieved a similar maximum oper-
ating speed. Finally, the testing framework degraded by 13% the maximum
performance of the NACK/GO switch. The performance of the switch is lim-
ited by the test-wrappers placed on the critical path. Considering the TMR
solution, this is around 30% slower than the baseline switch while the pro-
posed switch delivers far more functionalities at the cost of a longer critical
path (+13%).
Figure 1.2: Area analysis.
14 Overview of Two Architectural Variants of a Mesh
Figure 1.3: Routing delay analysis.
3 Second Architectural Variant of a Mesh
The second variant is a parameterized n × m (n: number of input ports,
m: number of output ports) source based routing 2-stage switch, augmented
with fault-tolerance provisions. The scheme of the switch architecture is de-
picted in ﬁgure 6.2 and is composed of the following main blocks:
- A fault tolerant input buﬀer of two slots with triplicated control logic and
endowed with voters.
- A fault tolerant output buﬀer of six slots with the same characteristics of
the input buﬀer.
- A fault tolerant arbiter, triplicated and endowed with voters.
- A Path-Shift module and a Crossbar.
- Some comparators are placed in speciﬁc places for runtime diagnosis and
to notify the global manager.
The next section describes the behaviour of each block of the switch:
3.1 Tightly Coupled Dc FIFO
Synchronization interfaces, such as dual-clock FIFOs, are typically instanti-
ated as external blocks with respect to the module they are connected with.
This ”loose coupling” of synchronizers with respect to NoC components im-
plies several drawbacks. First, the FIFO module introduces additional com-
munication latency in the intercommunication link. As a result, provisions
1.3.1 Tightly Coupled Dc FIFO 15
Figure 1.4: Dual-Clock FIFO integration into one input port of the switch
architecture.
must be normally made since the ﬂow control signal may arrive multiple
clock cycles after the destination module decides to halt the source module.
The problem can be addressed by reserving space in the destination buﬀer,
thus incurring a signiﬁcant area and power overhead, or by enhancing the
dual-clock FIFO with ﬂow control capability.
In the EU-funded GALAXY project, the aforementioned problem was tackled
by merging the dual-clock FIFO with the switch input buﬀer, thus coming
up with a unique architecture block in charge of buﬀering, synchronization
and ﬂow control, and sharing buﬀering resources for all of these tasks. The
GALAXY project has also showed that this design principle, which we denote
as ”tight coupling” of synchronizer with the NoC, can be applied to dual-clock
FIFOs in a straightforward way. For this reason, the second variant switch
can optionally replace its input buﬀer for a fully synchronous environment
with a dual-clock FIFO for a multi-synchronous environment, as illustrated
in Figure 6.1. In all cases, functional correctness is guaranteed.
A similar functionality can be easily implemented also in the ﬁrst variant
switch.
16 Overview of Two Architectural Variants of a Mesh
3.2 Input/Output Buﬀers
Input and output buﬀers are much simpler than the ones of the ﬁrst switch
variant, since they do not have to handle the Nack/go ﬂow control protocol
but rather the simpler stall/go one. The input buﬀer is sized with two slots,
which is the minimum amount of resources needed not to lose data during
stall activation. It was 3 with Nack/go. The output buﬀer can be arbitrarily
sized for performance buﬀering. As previously mentioned, control logic of
input and output buﬀers is triplicated for fault tolerance and endowed with
voters. An additional voter is placed on top of the data-path registers with
the purpose of voting the outputs from the three instances of the buﬀer
control logic. The voted output drives the read and write pointers of FIFO
data registers.
3.3 Probing System
At the same time, probes inserted in front of each voter sniﬀ their inputs
and inform (through a comparator and an OR gate) the global manager
about possible malfunctioning of each of the replicated branches. We ﬁnd it
important that the manager can keep this kind of information under control,
so to be aware of a possible degradation of the fault-tolerance capability of the
architecture. The OR gate collects the outputs of the comparators associated
with each voting stage, as well as a notiﬁcation signal from the correction
sub-system denoting whether correction actions have been performed or not.
Through the OR tree, a global notiﬁcation of malfunctioning is achieved for
each switch and notiﬁed to the global controller via a star interconnection
topology.
3.4 Path Shifting
The Path-Shift module is composed of the following blocks:
- A demultiplexer, immediately inserted after the output of the pipeline stage.
It is composed of two inputs (data input and select input) and two outputs
(one for head ﬂits and the other one for payload/tail ﬂits).
- A Shifter and an encoder placed along the path followed by head ﬂits.
- A 2x1 multiplexer
1.3.5 Control Path 17
When a new ﬂit arrives in front of the Path-shift module, we need to identify
the ﬂit type, i.e., whether it is a head ﬂit or not.
For doing this, the select input port of the mux/demux is directly controlled
by the ﬁrst bit of the input ﬂit.
In fact, this bit is set to ”1” for a head ﬂit and to ”0” for payload/tail ﬂit.
So, when the input ﬂit is a tail or a payload, path shifting is bypassed.
On the contrary, when a head ﬂit arrives, we need to shift the routing infor-
mation so that each switch can always ﬁnd in the same position its target
output port. Alternatively, we would need to embody in the packet the indi-
cation of how many hops the packet has already gone through. This way, the
switch would have to point every time to a diﬀerent location in the packet
head. After shifting the address bits, checkbits are not meaningful any more
and need to be recomputed by the encoder before the ﬂit can move on.
3.5 Control Path
Arbitration is performed with a round robin arbiter with triplicated control
logic. Each instance of the arbiter is endowed with voters for self-correction;
additional voters are located on top of crossbar multiplexers for reconvergence
of the control path to the control inputs of the data path. Similarly to the
ﬁrst switch variant, a new arbiter state is saved only after voting it, to make
sure that triplicated FSMs do not get misaligned as an eﬀect of errors. This
would compromise reliability of the control path for future transactions.
4 Experimental Results
This section describes the experimental results of the second variant of a
5x5 switch synthesized at the target speed of 500MHz with the 40nm low-
power SVT Inﬁneon technology library. Input buﬀers are assumed to be fully
synchronous.
Figure 6.4 shows the area overhead of the proposed switch (rightmost bar)
with respect to an intermediate implementation without any testing support
and to a baseline TMR extension of the xPipeslite switch. It can be observed
that area overhead for testing amounts to only 12.96% (Chapter 6 describes
18 Overview of Two Architectural Variants of a Mesh
Figure 1.5: The Second architectural variant at a glance.
the testing architecture on top of the second variant). At the same time, more
functionalities and provisions than the TMR switch are delivered at a much
lower area footprint. Figure 6.5 illustrates the area breakdown of the testing
logic. The major contributor to the testing logic comes from the MISRs used
to perform the diagnosis and counts for ∼8.50%. The remaining part of the
overhead is spread among the wrappers and the LFSRs used as test patterns
generators.
Last but not least, when replacing the input buﬀer with a dual clock FIFO, in
practice there is no area overhead provided we keep the number of buﬀer slots
the same. Actual buﬀer sizing then depends on network-level requirements
such as the speed ratio between sender and receiver as well as the needed
throughput across a multi-synchronous link[52].
1.4 Experimental Results 19
Figure 1.6: Area overhead @500MHz.
Figure 1.7: Testing overhead and area Breakdown.
20 Overview of Two Architectural Variants of a Mesh
Figure 1.8: Normalized routing delay @Max Performance.
5 Conclusions
In this chapter, we present two architectural variants of a mesh endowed with
fault-tolerance, notiﬁcation infrastructure and overlapped static reconﬁgura-
tion capability. We showed the major step in design complexity with respect
to a state-of-the-art switch for low-to medium-end embedded systems, arising
from the more aggressive requirements on switch functionality. At the same
time, we showed that more functionality than TMR can be delivered within
the same area budget, but with a non-negligible speed penalty.
Chapter 2
Synthesis Flow
1 Design ﬂow
The synchronous ﬂow regards the optimization of system-level hold-time mar-
gins, to improve system robustness to timing variability. In synchronous sys-
tems, the implementation of a global clock distribution network has increas-
ing adverse eﬀects on timing, as systems scale up in size and transistors scale
down in geometries.
Smaller geometries mean higher variability, while larger gate count means
greater non-shared depth of the clock tree, between the diﬀerent system-level
regions of the chip.
Clock variability causes timing issues, as the clock skew aﬀects the timing in
two ways: setup-time wise and hold-time wise.
While setup-time issues cause the system to function at reduced performance,
hold-time violations may render a system dysfunctional. As clock variability
is a statistical eﬀect, the timing degradation relates to yield: fewer chips will
function properly.
Today, the normal way to improve hold-time margin in a circuit involves
insertion of delay buﬀers at the data path end points. This aﬀects also the
setup-time margin of the path.
In this chapter, the compress hold time prototype tool to optimize a circuit for
hold-time, is presented. The tool includes innovative algorithms for insertion
of hold-time buﬀers not only at the end points but in any branches of the data
paths of a design. An algorithm that ﬁxes hold-time without worsening setup-
22 Synthesis Flow
time slack is also implemented (Further details are covered by conﬁdentiality
and Non Disclosure Agreement).
Maintaining positive setup-time slack is important in order to ensure head-
room for successful backend convergence. The tool is successfully demon-
strated on three state-of-the-art IC designs, and will be applied to NoC-based
systems. It is shown how the tool is able to ﬁx hold-time violations that are
not ﬁxable with a standard end point buﬀer insertion approach, and to ﬁx
hold-times without worsening total positive setup-time slack.
This was achieved by the compress hold time tool automatically inserting
delay buﬀers deep within the circuit, in non-setup-time critical branches of
the data paths. A mixed-approach provides the best of both worlds, resulting
in signiﬁcantly better results than standard approaches.
1.1 Introduction
More than 99% of all digital ICs are implemented in a synchronous manner. A
synchronous system is characterized by having a synchronously clocked tim-
ing reference signal. In synchronous systems the implementation of a global
clock distribution network has increasing adverse eﬀects on timing, as sys-
tems scale up in gate count and transistors scale down in geometries. Smaller
geometries mean higher variability in the logic gates, while larger system size
means greater non-shared depth of the clock tree, between the diﬀerent re-
gions of the chip. Clock timing variability, deﬁned as variability in the clock
arrival time at diﬀerent clock tree sink points, is thus increasing in advanced
IC designs due to two factors.
i. Increasing on-chip variation (OCV) in clock logic gates, due to scaling
down fabrication technologies into nano-scale geometries.
ii. Increasing non-common clock tree path levels between system-level blocks,
due to design gate count scaling up into giga-scale.
Figure 2.1 illustrates this challenge. The non-common clock path is deeper
due to a larger clock tree, while the variability at each clock tree node is
increasing. As a direct result, the timing variability, between clock tree sink
points, increases. Clock timing variability causes data path timing issues, as
2.1.1 Introduction 23
Figure 2.1: Increasing clock tree depth and increasing on-chip variability
causes increasing system-level timing uncertainty.
the clock skew aﬀects the data path in two ways: setup-time wise and hold-
time wise. While diminishing setup-time margins will reduce design perfor-
mance, hold-time issues will render circuits dysfunctional. As clock variability
is a statistical eﬀect, the timing degradation relates to yield, as fewer chips
will function properly. Figure 2 shows how performance/timing related is-
sues constitute the fastest increasing yield degradation factor in scaling IC
technologies.
Today, the normal way to ﬁx hold-time violations in a circuit involves in-
sertion of delay buﬀers at the data path end points. This also aﬀects the
setup-time slack of the circuit. As a result, it is not always possible to ﬁx
hold-time violations without causing setup-time violations in the process. But
even ﬁxing hold-time violations in a manner that does not directly violate
setup-time may also represents a degradation of the circuit, if the positive
setup-time slack is reduced. This represents a major drawback in existing
tools. Maintaining positive setup-time slack is important in order to ensure
24 Synthesis Flow
headroom for successful backend convergence. The tool functionality signif-
icantly extends state-of-the-art in hold-time optimization, in that it enables
a major increase in ﬂexibility and performance in the hold-time optimiza-
tion process, by providing advanced new algorithms for intelligent hold-time
buﬀer insertion. The tool will be applied to NoC-based designs in this chap-
ter. It is shown how hold-time violations that cannot be resolved using end
point buﬀer insertion can be resolved, with a reduced penalty on total posi-
tive setup-time slack and no impact on worst positive setup-time slack. In the
prototype tool, a focus on system-level communication channels has been im-
plemented by allowing hold-time ﬁxing to occur only on data paths between
system-level partitions in a design, as arbitrarily speciﬁed by the user.
2 IC Timing
Timing is by far the most important design parameter in IC design today.
The timing of a circuit determines its performance as well as its robustness
to fabrication variability.
2.1 Timing Margins
Circuit timing revolves around two main concepts:
i. Hold-time margin
ii. Setup-time margin
Figure 2.2 illustrates these two timing concepts. In the ﬁgure, data indicates
a data arrival point, e.g. the data input of a ﬂip ﬂop or similar state-holding
element. The hold-time margin is the time after an active clock edge during
which the data value from the previous clock cycle must retain its state. This
is to ensure that the internal state of the state-holding element is completely
stable before the cell input starts to change. A hold-time violation will oc-
cur if a data path is very fast, so that new data arrives from another ﬂip
ﬂop a very short time after the clock has toggled. The setup-time margin is
the time before an active clock edge during which the data value from the
2.2.1 Timing Margins 25
Figure 2.2: Increasing clock tree depth and increasing on-chip variability
causes increasing system-level timing uncertainty.
previous clock cycle must attain its ﬁnal state. This is to ensure that the in-
ternal state of the state-holding element is completely stable before the clock
input toggles. A setup-time violation will occur if a data path is very slow,
so that new data arrives from another ﬂip ﬂop too long time after the clock
has toggled, i.e. too short time before the following clock event. Hold-time
and setup-time margins are both inﬂuenced by variability in the clock tim-
ing. If the clock arrival time of the transmitting ﬂip ﬂop and receiving ﬂip
ﬂop are skewed relative to each other, the margins can either be increased
or diminished. Skew in one direction improves setup- time slack while wors-
ening hold-time slack, skew in the other direction improves hold-time slack
while worsening setup-time slack. Since the nature of variability is that the
direction of the skew is unknown, in order to accommodate the worst-case
scenario, both the setup- and hold- time margins must be higher than the
worst-case clock skew. In order to ensure that hold-time issues do not occur,
safety margins can be added. While such margins increase the reliability and
manufacturability they also tend to limit the performance of the circuits, by
taking an increasingly conservative view of circuit timing.
26 Synthesis Flow
Figure 2.3: Increasing clock tree depth and increasing on-chip variability
causes increasing system-level timing uncertainty.
Figure 2.4: Increasing clock tree depth and increasing on-chip variability
causes increasing system-level timing uncertainty.
3 Fixing hold-time violations
Hold-time violations can be ﬁxed by inserting extra delay in the data path.
This eﬀectively improves the hold-time margin by slowing down the data
signals. The normal approach to ﬁx hold-time violations today is by inserting
buﬀers directly at the data path end points, as an ECO design ﬂow step.
This approach is simple and works well, if a corresponding positive setup-
time slack is available. The data path leading to the input of a ﬂip ﬂop can
2.4 Requirements for an IC timing optimization tool 27
be complex however, and as shown in Figure 2.3, the end point timing may
be both hold-time critical and setup-time critical. This occurs if there are
both fast and slow paths leading to the end point. While inserting buﬀers at
the data path end point slows the signal down and improves the hold-time
margin, it meanwhile borrows setup-time, worsening the setup-time margin.
This is illustrated in Figure 2.4. This is not desirable for a number of reasons.
Firstly, if the end point is setup-time critical, an end point buﬀer cannot be
inserted without causing a setup-time violation. This is not acceptable, as
setup-time determines the performance of a circuit. Secondly, even if a degree
of positive setup-time margin exists, it is not wishful to reduce the positive
setup-time slack, as this is often used to ensure a margin for timing closure
later in the design ﬂow.
4 Requirements for an IC timing optimiza-
tion tool
Timing of modern IC designs, and integrating into mainstream IC design
ﬂows, is a complex task. Developing a timing optimization tool, there are a
number of advanced basic requirements to functionality and standard format
compliancy. To import and analyse a circuit, the following functionality is
required:
i) Import of
a. Liberty cell libraries
b. SDC timing constraint commands
c. SDF cell delay information
d. Gate-level Verilog netlist
ii) Support for complex, multi-clock and clock gated/muxed architectures.
iii) Support for multiple timing modes.
Apart from the development of the actual hold-time buﬀer insertion algo-
rithms, a major part of the work performed in achieving the deliverable
28 Synthesis Flow
described herein was focused on expanding and maturing existing FloorDi-
rector STA capabilities, integrating the hold-time buﬀer insertion algorithms
into the STA engine, and implementing support for the required netlist mod-
iﬁcation and Verilog export.
5 The Compress Hold Time Tool
A prototype tool for improving design hold-time margins, known as com-
press hold time, has been developed, and is embedded into Teklatechaˆs Flo-
orDirector IC design framework. The tool takes advantage of FloorDirectoraˆs
built-in import and export functionality as well as Static Timing Analysis
(STA) engine. Insertion of hold-time buﬀers in arbitrary branches of the data
paths, and insertion of hold-time buﬀers with no eﬀect on setup-time slack,
constitutes the two key innovations in this chapter. The tool is fully con-
ﬁgurable, and an arbitrary level of robustness of hold-time can be achieved.
The impact on setup-time can be limited to the minimum level inherent in
the data path structure of the circuit.
5.1 Hold-time buﬀer insertion
The compress hold time prototype tool provides advanced functionality for
optimizing hold- time issues, ﬁxing hold-time violations and improving exist-
ing hold-time margins, moving state-of-the-art signiﬁcantly forward. Multi-
mode timing is an integral part of the algorithms, and timing validated con-
currently across multiple timing scenarios. The tool analyses the netlist and
timing of a design, and based on a parameter setup.
Hold-time is optimized by slowing down certain branches of design data
paths, by inserting buﬀers. While existing tools do this by inserting buﬀers at
the data path end points, compress hold time takes a much more advanced
approach. While simple end point buﬀer insertion is also possible using com-
press hold time, the tool implements a number of algorithms. A more ad-
vanced algorithm implemented as part of compress hold time works by iden-
tifying non-setup-time critical branches of hold-time critical data paths, and
automatically inserting hold-time buﬀers deep within the circuit. While not
2.6 Results 29
being limited to inserting hold-time buﬀers at the data path end points, com-
press hold time is thus able to improve hold-time slack with little or no eﬀect
on end point setup-time slack. The algorithm is also able to ﬁx hold-time vi-
olations that are not possible to ﬁx by end point buﬀer insertion alone. This
is demonstrated in the results section. Finally, compress hold time is partic-
ularly useful for NoC-based designs, in which it may be particularly useful to
optimize hold-time margins in system-level paths more aggressively, in order
to improve robustness to system-level clock variability. Functionality for par-
titioning a design exists in FloorDirector, and compress hold time can take
this partitioning into account when optimizing. The optimization can be set
to include only paths between partitions in the optimization.
6 Results
6.1 Hold Time Robustness:Fine-tuning the ﬂow on a
2D mesh
This section aims at the validation of two hold time robustness techniques
applied on top of the same baseline NoC (4x4 mesh). The baseline NoC makes
use of the xpipeLite switch as its basic building block. This experiment was a
ﬁne tuning experiment of the hold-time improvement ﬂow on the simple test
case of a 2D mesh. The ﬁrst technique named here HTR1, improves the hold
time while keeping constant the setup time. On the contrary the second one
(named here HTR2) allows to improve the hold time by reducing the setup
time. In the experiments we conducted, the hold time robustness techniques
have been applied only to NoC links. The reason for this is that NoC routers
are relatively small objects, therefore it is not diﬃcult to enforce a tight skew
constraint inside them. Vice versa, interconnected switches may be well far
apart, therefore there the ultimate skew is far more unpredictable and the
need for hold time optimizations arise. For hold time robustness, this means
that safe margins against later possible degradations during the place&route
step and even later as an eﬀect of process variations have to be enforced.
We will hereafter compare the two techniques mentioned above (HTR1 and
HTR2) with respect to the 4x4 mesh devoided of any hold time optimization.
30 Synthesis Flow
The min-max synthesis we conducted was made by using the following 40nm
libraries:
- ucstarlib lpsvt 12t Pslow V090 T125
- ucstarlib lpsvt 12t Pfast V121 Tm30
Once the netlist was generated by the logic synthesis tool, we applied the
HTR1/HTR2 techniques on top of the same netlist and we measured respec-
tively:
- The worst and the total hold/setup time slacks on the links (see tables 2.1
and 2.2)
- The total hold/setup time slacks on the whole design (see table 2.3)
- The area overhead (see table 2.4)
Table 2.1 contains the worst hold/setup time slack on the link of each of
the three designs mentionded above. From left to right, these designs are
respectively the unoptimized 4x4 mesh (traditional synthesis ﬂow), vs. the
HRT1- and HTR2-optimized netlists. The second line of this table contains
the worst hold time slack on the link, measured in the best case library.
As we can see on line 2 of that table, the HTR1 hold time slack results to
be greater than that of the 4x4 mesh (25% of improvement), moreover the
HTR2 hold time slack is 40% greater than that of the HTR1 one. The third
line of the same table contains instead the worst hold time slack on the link,
measured in the worst case library. In this latter case, the HTR1 technique
improves the hold time slack by 40% with respect to the 4x4 mesh while that
of the HTR2 one results to be roughly 90.4% greater than that of the HTR1
one. As regards the setup time (see line 4 of table 8), they are almost all
equal. More precisely, the 4x4 mesh and the HTR1 have the same setup time
while that of the HTR2 is a little bit smaller than the other values. Indeed
these measurements are in perfect agreement with expected results, hence
represent a perfect calibration of the NaNoC ﬂow for hold time robustness.
On one hand, the HTR1 technique allows to improve the hold time slack
while keeping the setup time slack constant (0.48ns vs 0.48ns). On the other
hand, the HTR2 one allows to improve the hold time slack by reducing the
setup time slack (only 2% of reduction).
The informations contained in table 2.2 are similar to those of table 2.1. Here
2.7 Conclusion 31
instead of measuring the worst slack on the link, we measured the total worst
slack over all links. In this case, the HTR1/HTR2 hold time slacks measured
in the best case library (see line 2) result to be respectively 61% and 62.4%
greater than those of the 4x4 mesh. When measurements are made in the
worst case library (see line 3), the same HTR1/HTR2 hold time slacks result
to be respectively 104.8% and 108.8% greater than those of the 4x4 mesh.
Finally, the overall setup time slack degradation of HTR2 with respect to
that of the HTR1 one is only 3.7%. Table 2.3 contains measurements about
the total worst hold/setup time slack on the whole design. As expected, the
HTR1 technique allows to improve the overall hold time slack (see lines 2 and
3) with respect to the 4x4 mesh while keeping the setup time slack almost
constant (see line 4). As regards HTR2 technique, it allows a better hold time
slack improvement than the HTR1 one (13.4% vs 13.0% in the aˆbest case
libraryaˆ and 20.9% vs 24.6% in the aˆworst case libraryaˆ) but at the expense
of the setup time slack degradation (2.26% vs 0.25%). Table 2.4 shows the
area cost of the 4x4 mesh and the HTR1/HTR2 tech- niques. The total area
(expressed in %) has been normalized with respect to that of the 4x4 mesh.
From line 5 of table 2.4, it appears that both techniques (HTR1/HTR2)
require almost the same area overhead, 13% for the HTR1 and 12% for the
HTR2 one. Moreover, when considering the area breakdown, it appears that
all this overhead comes from combinatorial cells added to improve the hold
time (see line 2 of table 2.4). On the other hand the non combinatorial area
remains constant in all the three cases (see line 3 of table 2.4).
Worst slack on link 4x4 mesh HTR1 HTR2
Hold time bc lib 0.08ns 0.10ns 0.14ns
Hold time wc lib 0.15ns 0.21ns 0.40ns
Setup time wc lib 0.48ns 0.48ns 0.47ns
Table 2.1: Worst slack on links: unoptimized 4x4 mesh vs. HTR1 vs. HTR2.
7 Conclusion
As shown in the results section, this was successfully achieved. It was shown
how the newly developed algorithms achieve better results on both total
32 Synthesis Flow
Worst slack on link 4x4 mesh HTR1 HTR2
Hold time bc lib 899.98ns 1449.2ns 1462.34ns
Hold time wc lib 2078.97ns 4258.17ns 4342.15ns
Setup time wc lib 5931.52ns 5997.22ns 5773.8ns
Table 2.2: Total slack on links: unoptimized 4x4 mesh vs. HTR1 vs. HTR2.
Worst slack on whole design 4x4 mesh HTR1 HTR2
Hold time bc lib 4144.77ns 4685.48ns 4699.57ns
Hold time wc lib 10362.65ns 12533.35ns 12918.75ns
Setup time wc lib 26021.3ns 26087ns 25431.8ns
Table 2.3: Total slack on whole design: unoptimized 4x4 mesh vs. HTR1 vs.
HTR2.
Area 4x4 mesh HTR1 HTR2
Combinatorial Area 81937um2 106355um2 104720um2
Non Combinatorial Area 104282um2 104282um2 104282um2
Total Area 186219um2 210637um2 209002um2
Total Area (%) 100% 113% 112%
Table 2.4: Total Area: 4x4 mesh vs. HTR1 vs. HTR2.
number of hold-time violated end points resolved and total positive setup-
time slack reduction. The target was to improve the block-to-block hold-time
robustness to 25% of the clock cycle for any circuit. The compress hold time
prototype tool provides full ﬂexibility, and the level of robustness required can
be speciﬁed arbitrarily. Together with the capabilities to optimize without
borrowing setup-time slack, a solution to any given level of robustness can be
achieved with a minimal impact on setup-time slack, as per the limitations
inherent to the structure of the data path 1.
1This chapter has included contents that are referred to a cooperative and interdisci-
plinary work where furher details are in[74].
Chapter 3
Contrasting Multi-Synchronous
MPSoC Design Styles for
Fine-Grained Clock Domain
Partitioning: the Full-HD
Video Playback Case Study
1 Abstract
Fine-grained (per-core) multi-synchronous systems calls for new clocking
strategies and new architecture design techniques. This chapter compares
two fundamental multi-synchronous implementation variants based on the
extensive use of dual-clock FIFOs vs mesochronous synchronizers respec-
tively. The architecture-homogeneous experimental setting, the cost-eﬀective
merging of synchronizers with NoC switch buﬀers, the sharing of as many
physical synthesis steps as possible between the two architectures and the
requirements of a realistic full-HD video playback application are the key
innovations of this study.
34
Contrasting Multi-Synchronous MPSoC Design Styles for Fine-Grained
Clock Domain Partitioning: the Full-HD Video Playback Case Study
2 Introduction
Pioneer research on GALS systems envisions the use of clockless intercon-
nect fabrics bridging synchronous islands with each other [27, 28] in a multi-
processor system-on-chip (MPSoC). A few chip demonstrators prove the via-
bility of this solution [24, 25, 26], yet they have not resulted in the widespread
adoption of asynchronous NoCs in the industrial arena. The reason is that
the gap between asynchronous handshaking techniques and current design
toolﬂows is still too large and in most cases uneconomical to bridge. As an
example, they require unconventional circuits such as Muller C-elements that
are usually unavailable in standard cell libraries. Moreover, asynchronous
logic is not well supported by mainstream CAD tools. Even in those cases
where physical design falls within reach of such tools, this is done with a lot
of manual intervention and disabling fundamental tool optimization features
not to violate speciﬁc timing constraints of asynchronous circuits [40], hence
resulting in largely unoptimized designs. In this context, the best solution
found so far for prototype design and fabrication consists of implementing
routers and GALS interfaces as hard macros using ad-hoc design styles [26].
Hard macros should be viewed in this case more as a way of working around
the lack of proper design and veriﬁcation tools rather than an aggressive
optimization technique. In fact, area of these designs remains consistently
larger than fully synchronous NoC counterparts (1.8× in [26]). Regardless
of the design toolﬂow, it should be observed that as RC propagation delay
of on-chip interconnects degrades in nanoscale technologies the handshaking
operations in asynchronous NoCs start to last a considerable amount of time
thus signiﬁcantly aﬀecting communication performance.
The above landscape calls for a more evolutionary and cost-eﬀective solution
in the direction of a progressive relaxation of synchronization assumptions in
nanoscale MPSoCs.
Common design practice consists of implementing clock domain crossings
by means of dual-clock FIFOs. However, this solution is expensive in terms
of buﬀering resources in the FIFO itself (needed to absorb the clock speed
diﬀerence) and of FIFO crossing latency, which can be of several clock
cycles [32, 34]. This overhead is likely to worsen in the future to counter the
3.2 Introduction 35
degradation of the resolution time constant of synchronizers [51].
Dual-clock FIFOs can be used to build up DVFS (Dynamic Voltage and Fre-
quency Scaling)-enabled systems with ﬁne spatial locality by following the ar-
chitectural template in Fig.4.1, hereafter denoted as plain multi-synchronous.
They are instantiated at every switch port thus implementing clock domain
crossing for inter-switch communication. Each switch belongs to the clock
domain of its associated IP core. This solution replicates at a larger scale
the overhead of the FIFO synchronizers and introduces routing delay un-
predictability. In fact, network packets might have to cross low-speed clock
domains on their way to destination, and the spatial distribution of such per-
formance bottlenecks might change at runtime depending on the use case. On
the other hand, this architecture template removes the need for a global clock
tree, hence potentially resulting in better scalability and lower sensitivity to
technology constraints.
Since network-level implications of extensively using dual-clock FIFOs for
clock domain crossings are still largely unexplored, current design practice
consists of conservatively using these components only for coarse-grained sys-
tem partitioning in order to keep the overhead aﬀordable. It is however not
clear whether the architectural template in Fig.4.1 is viable for cost-eﬀective
and ﬁne-grained multi-synchronous MPSoCs like those in [30].
One promising synchronization technique that is fully compliant with the
multi-synchronous paradigm is mesochronous clocking. Mesochronous syn-
chronizers allow a reliable communication between synchronous blocks de-
rived from a master clock (hence sharing the same frequency) but suﬀering
from arbitrary phase oﬀset. This could be the case of a NoC inferred as an
independent clock domain, as illustrated in Fig.4.2 and hereafter denoted
as the mesochronous NoC. Given the chip-wide extension of the network
domain, clock distribution might be unbalanced and the diﬀerent switches
might receive the same clock signal but with a diﬀerent phase oﬀset. Con-
straining such oﬀset in the top-level clock tree might be either overly power
expensive or even infeasible for large nanoscale designs. Hence, mesochronous
synchronizers might be used to retime the data and transfer it reliably from
one switch to another. Even this architecture requires dual-clock FIFOs to
decouple the network from the clock domains of the individual IP cores, how-
36
Contrasting Multi-Synchronous MPSoC Design Styles for Fine-Grained
Clock Domain Partitioning: the Full-HD Video Playback Case Study
ever they end up being instantiated only at network boundaries, while inside
the network more cost-eﬀective mesochronous synchronizers (area-, power-
and latency-wise) are used.
While it can be easily demonstrated that mesochronous synchronizers are less
costly than dual-clock FIFOs when considering these synchronizers in isola-
tion, their integration within an entire platform might question this conclu-
sion since a number of typically overlooked eﬀects come into play. First, the
synchronizer might have to be augmented to enforce timing margins for the
layout constraints of the design at hand (e.g., length of speciﬁc links). Second,
further complexity might be needed to implement clock gating for the case
of idleness. Third, requirements for more buﬀer slots might be posed to con-
nected NoC switches to enable full throughput operation. Fourth, while the
full-empty protocol of a dual-clock FIFO directly matches a stall/go ﬂow con-
trol protocol in the network, augmenting mesochronous synchronizers with
ﬂow control capability is not equally straightforward. Above all, the main
diﬀerentiating factor between the architectures in Fig.4.2 and Fig.4.1 is the
presence of a global and independent clock domain for the NoC.
As a result, identifying the most eﬃcient implementation of multi-synchronous
NoCs is non-trivial and requires careful consideration of the application do-
main, of the system architecture and of physical synthesis eﬀects. Conclusions
cannot be clearly drawn by assessing synchronization interfaces in isolation.
This chapter takes the network-level perspective and aims at quantifying de-
sign quality metrics of the two architectural templates with layout awareness.
For the sake of fair comparison, we implemented the two multi-synchronous
NoC variants with the same library of NoC components (the xpipes library)
and brought them through the same physical synthesis process, apart from
the few design steps that are solution-speciﬁc. Frequency settings of IP cores
and of the network (in the mesochronous case) strongly impact relative per-
formance and power ﬁgures, in addition to dictating constraints for the phys-
ical synthesis. Since these settings are tightly application-dependent, we im-
plemented an important case study for future mobile devices: full-HD video
playback. This enabled us to set the operating conditions for the frequency
islands in the NoCs and to simulate realistic communication bandwidths be-
tween the cores.
3.3 Related Work 37
VOLTAGE AND FREQUENCY ISLAND
CORE
NETWORK INTERFACE
SWITCH SWITCH
SWITCHSWITCH
DC_FIFOs
VOLTAGE AND FREQUENCY ISLAND VOLTAGE AND FREQUENCY ISLAND
VOLTAGE AND FREQUENCY ISLAND
Figure 3.1: Plain multi-synchronous architecture based on dual-clock FIFOs.
MESOCHRONOUS
VOLTAGE AND FREQUENCY ISLAND
VOLTAGE AND FREQUENCY ISLAND
VOLTAGE AND FREQUENCY ISLAND
VOLTAGE AND FREQUENCY ISLAND
CORE
NETWORK INTERFACE
DC_FIFO
DC_FIFO
SWITCH SWITCH
SWITCHSWITCH
MESOCHRONOUS NoC
SYNCHRONIZERS
Figure 3.2: Mesochronous synchronization in a multi-synchronous architec-
ture.
3 Related Work
Many works are focused on asynchronous interconnection networks for GALS
systems, eliminating the need for global clock distribution. The CHAIN in-
terconnect [39], the ANOC architecture [41, 42], the prototype GALS NoC in
[43], the RASP network [44] and the mesh-of-tree topology in [38] are relevant
examples thereof. Mesochronous clocking is a milder approach to the relax-
ation of synchronization assumptions in MPSoCs. A common design method
of mesochronous synchronizers consists of delaying either data or the clock
signal to sample data reliably [45, 46] and/or to use a phase detector circuit
38
Contrasting Multi-Synchronous MPSoC Design Styles for Fine-Grained
Clock Domain Partitioning: the Full-HD Video Playback Case Study
[36].
Delay-line based synchronizers are mostly suitable for full custom designs.
A diﬀerent approach within reach of SoCs is proposed in [45, 35]: it em-
ploys cyclic write and read pointers with a certain initial spread to allow
collision-free write and read operations. [53, 34] employ similar approaches
while also synchronizing back-pressure signals. [53] investigates tight cou-
pling of mesochronous synchronizers with NoC switches. Dual-clock FIFOs
are the intuitive way of decoupling clock domains from each other, however
they incur large area, power and latency overhead, thus motivating research
eﬀorts to mitigate their cost [33, 31, 32].
The dual-clock FIFO architecture in [52] borrows the token ring solution
for FSMs from [32] and the asynchronous comparison of pointers from [31].
Above all, it is integrated inside NoC switches serving as multi-purpose input
buﬀer.
In this chapter, we borrow and/or adapt synchronizer architectures from pre-
vious work in [52, 53], where design issues are discussed for the synchronizers
in isolation. In contrast, the focus of this work is on the network level. Very
few previous works share the same abstraction level. The most relevant one
is [37], where the multi-synchronous DSPIN network is contrasted with the
fully asynchronous ASPIN solution with synthetic traﬃc. However, there is
no exploration of the design points for multi-synchronous systems.
4 Synchronizer Architecture
Synchronizer selection and tuning was made based on the following guidelines
for the sake of fair comparison:
1. Compliance with a standard cell design ﬂow to facilitate application to
the embedded computing domain.
2. Implementation of the source synchronous link design style, where syn-
chronizers receive a regular NoC link, carrying data, ﬂow control com-
mands and a copy of the clock signal of the sender used as a strobe
signal. This style is currently the most mature one for synchronizer-
intensive designs.
3.4 Synchronizer Architecture 39
3. Suitability for a tight coupling design style, where synchronizers are not
placed as external blocks in front of NoC switches, but rather integrated
with the input buﬀer of downstream switches. This way, the same buﬀer
slots of the synchronizer can be reused for switch performance buﬀering
and for ﬂow control purposes, thus coming up with a cost-eﬀective
implementation. With tight coupling, latency is signiﬁcantly reduced
as well, since additional stages are removed from the link and collapsed
into the switch input buﬀer. The reader should refer to [54] for an
overview of the beneﬁts of tightly coupled synchronizers, which we take
for granted in this work.
4. Same buﬀer switching policy: unused buﬀer slots because of network
idleness or lack of congestion should not be clocked. This choice will
then enable a fair comparison of idle power between architecture vari-
ants.
The considered dual-clock FIFO architecture is illustrated in Fig.3.3(a).
The size of this latter is parameterizable. Based on [52], at least 5 slots are
required in order to support arbitrary frequency ratios between sender and
receiver clocks. Full and empty detectors signal fullness and emptiness con-
ditions of the FIFO. These detectors perform an asynchronous comparisons
between the FIFO write and read pointers that are generated in clock do-
mains which are asynchronous to each other.
For this reason, 2-stage brute force synchronizers are used to synchronize
deassertion of the full signal in the sender domain and of the empty signal
in the receiver domain, as showed in Fig.3.3(a). Further details can be found
in [52].
In this chapter we also consider the ﬁnding in [51] that the resolution time
constant of synchronizers keeps degrading as feature sizes shrink, therefore
more stages will need to be cascaded in brute force synchronizers in order to
achieve MTBFs (Mean Time Between Failures) of practical utility. To model
such requirements, we augment the dual-clock FIFO with 4-stage brute force
synchronizers (instead of 2) and increase the number of data FIFO slots
to 7 to preserve full-throughput operation in this case. In the experimental
results, both the 5 slot and the 7 slot FIFO variants will be considered to
account for the trend pointed by [51].
40
Contrasting Multi-Synchronous MPSoC Design Styles for Fine-Grained
Clock Domain Partitioning: the Full-HD Video Playback Case Study
The dual-clock FIFO can be readily used for tightly coupled NoC-synchronizer
design, since it can directly serve as a switch input buﬀer. In fact, its empty
signal can be easily conditioned (see right-hand side of Fig.3.3(a)) and changed
into the valid signal for the switch arbiter. Based on it, the FIFO is admitted
to an arbitration round. Once a connection is established between an input
and output buﬀer of the switch, FIFO transmission can be stopped via the
stall/go ﬂow control signal RX stall.
The baseline mesochronous synchronizer architecture is borrowed from [53]
and illustrated in Fig.3.3(b).
The rationale is to temporarily store incoming information in one of the front-
end registers, using the incoming clock wire to avoid any timing problem
related to the clock phase oﬀset. Once the information stored in the front-
end registers is stable, it can then be read, processed and sampled by the
target clock domain.
Flow control is implemented by means of the stall/go signal, which freezes
synchronizer counters to prevent buﬀer overﬂow in the downstream block.
While this signal is already in synch with the back-end counter, it should be
synchronized with the transmitter clock before feeding the front-end counter.
A simple 1-bit synchronizer is instantiated for this purpose. This synchronizer
is replicated again in front of the upstream switch since it is demonstrated
in [53] that this solution gives rise to a larger timing margin for link delay.
The data path synchronizer can be easily coupled with the switch. For this
purpose, its output directly feeds the switch arbitration logic and its internal
crossbar. In practice, it serves as switch retiming and input buﬀer stage, in
addition to its native synchronization task. Unlike the (way more complex)
data path synchronizer, the replicated 1-bit control path synchronizer cannot
be integrated into the upstream switch, an approach which we denote as
hybrid coupling and which we follow in our implementation.
Finally, for the sake of fair comparison, we augmented the mesochronous
synchronizer of [53] with the same clock gating policy of the dual-clock FIFO.
Therefore, when there is no valid data in the synchronizer front-end, counters
are frozen and data buﬀers are not clocked. The back-end counter is stopped
after the valid signal is synchronized with the receiver clock domain by means
of another 1-bit synchronizer, as illustrated in Fig.3.3(b).
3.5 Full-HD Video Playback Requirements 41
Both the dual-clock FIFO and the mesochronous synchronizer are instanti-
ated as input buﬀers in the switches of the xpipesLite NoC to implement the
architecture variants in Fig.4.2 and Fig.4.1. Unlike the picture, it is worth
recalling that there are no synchronizers placed in NoC links, since they are
all collapsed in the input buﬀers of downstream switches or network inter-
faces (except for the 1-bit control-path synchronizer in mesochronous links).
The operating speed of the network or of its switches is now needed, but this
information is tightly application-dependent. Next section describes the use
case considered in this chapter.
5 Full-HD Video Playback Requirements
Extrapolating the usage scenarios of existing smart phones, one can imagine
that in some years from now, the video playback and capture capabilities
will not be limited to QVGA or WQVGA resolutions. With new mobile de-
vices having bigger display sizes, at least 1024x768 resolutions have to be
expected, but maybe even full HD-TV resolution. To provide a relevant ex-
perimental setting for the architectures under test, communication require-
ments of CPU, hardware accelerators and memory for video playback have
been scaled up to the high-end HD-TV resolution, thus addressing the most
challenging scenario foreseeable in the next few years. Extrapolated commu-
nication bandwidth requirements, based on the industrial experience of some
of the authors, are illustrated in Fig.3.4.
The operating speed of the IP cores depends on the bitwidth and on the
architecture of the core implemented in the real platform. Therefore, it was
only possible to identify a possible range of speeds for each core depending
on industrial IP core libraries and their projected future development. Table
3.1 lists possible min-max values of practical interest in the years to come.
5.1 System conﬁguration
Future high-end mobile computing platforms will be most likely hierarchical.
At top level, a number of heterogeneous components will be interconnected by
a communication infrastructure, which is likely to have an irregular structure.
42
Contrasting Multi-Synchronous MPSoC Design Styles for Fine-Grained
Clock Domain Partitioning: the Full-HD Video Playback Case Study
Token Ring
Counter
Token Ring
Counter
OR FF
FF
FF
FF
FF
FF
00SS S S
RX_STALLVALID_IN
Synchronizer Synchronizer
T
X
_
F
U
L
L
WRITE_POINTER
READ_POINTER
STALL VALID_OUT
CLK_TX CLK_RX
RX STALL
R
X
_
E
M
P
T
Y
W
Pi
W
Pi
R
Pi
R
Pi+1
SET
FULL
FULL_TMP EMPTY_TMP
SET
EMPTY
DATA_IN DATA_OUT
MUX
(a) Dual-clock FIFO architecture.
Mux Stall
clk_receiver
Flow Control
andData
Flow Control
Data andFlop_1
Flop_0
Flop_2
countercounter
counter
Front−end Back−end
clk_sender
Mux
Stall
counter
Flop_0
Flop_1
Flop_2
Flop_1
Flop_2
Mux
Stall
upstream
switch
clk_receiverclk_receiver
SWITCH INPUT BUFFER
counter
1−bit
synchronizer
To the Flop_0
(b) Mesochronous synchronizer architecture.
Figure 3.3: Architecture of synchronizers.
One component of this system will most likely be a multi-core programmable
accelerator consisting of a regular array fabric of computation tiles. Alterna-
tively, the entire system might only consist of such a regular fabric, like the
product in [48] for embedded computing.
The focus of this chapter is therefore on a 4x4 grid of identical general purpose
3.5.1 System conﬁguration 43
500MBytes/s
Decoder
Video
SPI
Slave
Slave
SD/MMC
DMA
SD/MMC
DMA
DSP
CPU
Display
Control
Accelerator
Graphics
DDR
Display
SRAM3
SRAM1
SRAM2
Audio
In/Out
System
2G
System
3G/LTE
7MBytes/s
500MBytes/s
500MBytes/s
500MBytes/s
100MBytes/s
32MBytes/s
0.2MBytes/s
0.1MBytes/s
0.1MBytes/s0.1MBytes/s
0.2MBytes/s
0.2MBytes/s
8MBytes/s
8MBytes/s
7MBytes/s
7MBytes/s
8MBytes/s
500MBytes/s
Figure 3.4: Communication bandwidth requirements for full-HD video play-
back: 1920x1080 pixel, 60 frames/s, true color.
processor cores onto which the requirements of Fig.3.4 and Table 3.1 are
mapped.
At runtime, the video playback use case might be activated by mapping tasks
to cores and by conﬁguring those cores to run at the speed that optimizes
execution of their associated task.
Without lack of generality and for the sake of fair comparison, we assume
that the frequencies in Table 3.1 are the speeds of the processor cores in the
homogeneous MPSoC.
44
Contrasting Multi-Synchronous MPSoC Design Styles for Fine-Grained
Clock Domain Partitioning: the Full-HD Video Playback Case Study
We mapped the tasks to the 16 nodes of the 4x4 2D mesh network with the
minimum-path mapping algorithm in [49], which optimizes hop delay with
priority for the most communication-hungry cores. Dimension-order routing
is assumed. An additional constraint was introduced: I/O peripheral con-
trollers (such as DDR or SPI controllers) had to be mapped on the periphery
of the chip, thus leaving central locations only for computation or storage
tasks.
For the pre-computed mapping, three possible frequency settings were chosen
for the cores within the bounds in Table 3.1. They are denoted in the same
table as Arch.1, Arch.2 and Arch.3.
In Arch.1 the maximum speeds were chosen for almost all cores assuming
500 MHz to be the maximum possible clock speed in the system. Arch.1 is
an aggressive scenario where high-speed IP cores are available and a lot of
pressure is put on the on-chip interconnect performance.
In order to highlight a key design concern in multi-synchronous systems,
settings were then changed as in Arch.2 to reﬂect the case where throughput-
intensive information ﬂows are exchanged between fast cores but are routed
through slow intermediate cores. Arch.2 tests robustness of architectures
under test to this unfortunate interrelation between operating speeds and
routing paths.
Finally, we artiﬁcially extended the case study to derive more general results
by selecting the same operating speed of 400 MHz for all cores (Arch.3).
This could be more easily the case in an homogeneous MPSoC with a coarse-
grained power management, or running an application which requires similar
frequency settings for its cores in spite of their non-homogeneous communi-
cation requirements.
6 Design Flow
System architectures were at ﬁrst modeled in RTL-equivalent cycle-accurate
SystemC. OCP traﬃc generators were set to run at the speeds in Table 3.1
and programmed to inject traﬃc based on the bandwidth requirements in
Fig.3.4.
At this level, the mesochronous NoC can be modeled and simulated as a
3.6 Design Flow 45
Min Max Arch.1 Arch.2 Arch.3
IP core (MHz) (MHz) (MHz) (MHz) (MHz)
Video Decoder 100 500 300 500 400
DDR 266 400 400 400 400
Graphics Accelerator 100 300 300 100 400
SRAM 1 200 500 500 500 400
Display Control 100 500 300 500 400
SRAM2 200 300 300 300 400
CPU 300 1000 500 500 400
Display (ext. interface) 250 1000 500 300 400
DSP 100 300 300 300 400
SRAM3 100 300 300 300 400
Audio In/Out 10 100 100 100 400
2G System 100 150 150 150 400
DMA 100 200 200 200 400
3G/LTE System 100 200 200 200 400
SD/MMC DMA 100 200 200 200 400
SD/MMC Slave 100 200 200 200 400
SPI Slave 64 128 128 128 400
Table 3.1: Range for IP core speed and chosen settings.
synchronous one, since as demonstrated in [53] with the tight (or hybrid)
coupling design style latency (in clock cycles) and throughput of the networks
are the same.1 SystemC functional simulation aims at deriving the equivalent
speed of the mesochronous NoC that enables it to match the video playback
performance of the dual-clock FIFO-based platform.
We exploited an industrial 65nm technology library for the physical synthe-
sis, using Synopsys Physical Compiler and Cadence SoCEncounter for logic
synthesis and for place&route respectively.
For both platforms, a hierarchical bottom-up approach was taken. With this
approach, it is possible to synthesize, place and route separately each switch
and then to assemble them together to build up the entire system at the
top level of the hierarchy. The target synthesis frequency of the switches in
both platforms under test is 500 MHz, i.e., the worst case speed every switch
should be able to support. In fact, we consider video playback as just one use-
case running on top of the network. Other use-cases might require diﬀerent
speed settings for the switches of the plain multi-synchronous platform or
1Latency in real elapsed time is slightly diﬀerent, however positive and negative skews
across switch-to-switch links oﬀset each other thus making the diﬀerence with a syn-
chronous NoC irrelevant. Again, this holds for a tightly coupled design style.
46
Contrasting Multi-Synchronous MPSoC Design Styles for Fine-Grained
Clock Domain Partitioning: the Full-HD Video Playback Case Study
for the NoC of the mesochronous platform, which we assume to be upper
bounded at 500 MHz.
A hierarchical clock tree synthesis [47] was performed for the mesochronous
NoC. Local clock trees of the switches were synthesized with a tight skew
constraint of 5% of the target clock period. In contrast, the top-level clock tree
was inferred with a relaxed skew constraint of 60% of the clock period, since
timing of switch-to-switch communications is safeguarded by mesochronous
synchronizers.
Both platforms under test make use of source-synchronous links. In order to
avoid any timing misalignment between the data and the source synchronous
clock wires, we used the automatic bundled routing feature of the routing
tool. This proved suﬃcient for timing closure in our case. In more challenging
scenarios, the same result can be achieved with some scripting eﬀort like in
[55].
Finally, it is worth mentioning that we used Synopsys PrimeTimePX to mea-
sure power on post-layout netlists with full annotation of switching activ-
ity from Verilog functional simulation. When measuring mesochronous NoC
power, the operating speed of the NoC was provided by the SystemC func-
tional simulation so to compare networks that provide the same application-
perceived performance.
7 Experimental Results
7.1 Area results
We ﬁrst of all comment on the area results after place&route, which are
reported in Fig.3.5 and normalized to the area of a plain multi-synchronous
platform with 7 slot dual-clock FIFOs.
For the baseline 5 slot FIFO implementations (two leftmost bars), the gap be-
tween the mesochronous and the plain multi-synchronous NoC is not large (a
3.53% lower area), since both mesochronous synchronizers and dual-clock FI-
FOs are merged with the NoC architecture, where they replace input buﬀers.
With a loosely coupled approach, mesochronous synchronizers would have
required an oversizing of switch input buﬀers for full throughput operation
3.7.2 Setting the speed of the mesochronous NoC 47
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Fifo5       Meso5         Fifo7        Meso7
A
R
EA
 (μ
m2
)
AREA
(65nm  ST Technology Library)
Figure 3.5: Area comparison between multi-synchronous NoC implementa-
tion variants when varying the FIFO depth.
which would have oﬀset their inherently lower number of buﬀer slots with re-
spect to dual-clock FIFOs. On the other hand, such augmented input buﬀers
do not dominate NoC area since other components (especially the output
buﬀers, sized to 6 slots in this architecture) play a major role.
However, even for short term chip implementations a 5 slot FIFO does
not guarantee a reasonable yield (see for instance [56]). An oversizing to 7
slots is advisable. When re-synthesizing both platforms under these assump-
tions, the two rightmost bars in Fig.3.5 point out an increased area gap: the
mesochronous NoC saves around 12% area. For larger systems (e.g., 64 cores)
this gap can only increase, since the plain multi-synchronous platform makes
larger use of dual-clock FIFOs than the mesochronous one.
7.2 Setting the speed of the mesochronous NoC
We now address the speed setting of the mesochronous NoC, which is pre-
liminary to power measurements.
By injecting the same traﬃc into the plain multi-synchronous and the mesochronous
NoCs based on the frequency settings of Arch.1 and the bandwidth require-
48
Contrasting Multi-Synchronous MPSoC Design Styles for Fine-Grained
Clock Domain Partitioning: the Full-HD Video Playback Case Study
400 500 600 700 800 900 1000
0.95
1
1.05
1.1
1.15
1.2
1.25
Frequency Mesochronous NoC [MHz]
Fra
me
 de
co
din
g t
im
e
 
 
Mesochronous architecture
Multi Synchronous architecture
(a) Arch1
100 150 200 250 300 350 400
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
Frequency Mesochronous NoC [MHz]
Fra
me
 de
co
din
g t
im
e
 
 
Mesochronous architecture
Multi Synchronous architecture
(b) Arch2
Figure 3.6: Determining the speed of the mesochronous NoC for the Arch.1
and Arch.2 settings.
ments of the video playback application, we measured the frame decod-
ing time from the SystemC functional simulation. Results are reported in
Fig.3.6(a) as a function of the speed of the mesochronous NoC.
As we can see, execution time decreases, at ﬁrst steeply and then more grad-
ually, as the frequency of the mesochronous NoC increases, until reaching a
saturation point. The intersection of the two curves returns the smallest fre-
quency of the mesochronous NoC which allows the same performance of the
plain multi-synchronous solution. Area and power consumption are tightly
dependent on this value.
3.7.2 Setting the speed of the mesochronous NoC 49
Clearly, for frequencies above 500 MHz the network is not the performance
bottleneck and such a setting would not be cost-eﬀective. In contrast, already
at 400 MHz the performance penalty is large, hence suggesting a frequency
of 500 MHz as the best choice for this case. This scenario is clearly a worst
case for the power eﬃciency of the mesochronous NoC, since in the multi-
synchronous NoC only a few islands operate at the maximum speed of 500
MHz, while the entire mesochronous NoC has to operate at such a speed.
This is an eﬀect of Arch.1 settings where processor core speeds are high
and performance critical packets end up crossing moderate- to high-speed
intermediate islands.
Diﬀerent results were obtained for the Arch.2 settings. This time, already
at 200 MHz the mesochronous NoC is able to match performance of the
plain multi-synchronous one (see Fig.3.6(b)). This is an eﬀect of throughput-
intensive packet ﬂows crossing low-speed intermediate islands, a scenario
which heavily penalizes the multi-synchronous NoC.
Either Arch.1 or Arch.2 might occur in practice. Usually, core speeds for task
execution on an homogeneous computation fabric are dictated by the appli-
cation and by the need to lower power consumption of the processor cores.
Typically, idle time is exploited to lower the core speed for better power ef-
ﬁciency or to temporarily switch-oﬀ the core. Operating conditions for each
task are taken and coordinated by the global power management framework.
At this level, communication costs are somehow abstracted since many details
of the hardware communication infrastructure may not be known in advance,
including the clock distribution strategy or the ﬁnal mapping. In heteroge-
neous systems, where diﬀerent IP cores ranging from CPUs to hardware
accelerators, I/O devices and memory macros are networked, these speeds
even depend on the library of available components. Then, mapping of tasks
onto the on-chip network is typically performed based on cost metrics such
as hop delay or power cost while meeting constraints such as that of non-
exceeding maximum link capacity [49] or the placement of I/O controllers.
As a result, there is typically no explicit constraint in synthesis ﬂows avoiding
the combination of Arch.2 settings with the mapping selected for this work.
Even assuming to introduce such a constraint, this would imply to either in-
crease processor core speed (thus increasing its power which might dominate
50
Contrasting Multi-Synchronous MPSoC Design Styles for Fine-Grained
Clock Domain Partitioning: the Full-HD Video Playback Case Study
0 
20 
40 
60 
80 
100 
120 
140 
160 
180 
200 
DC_FIFO 5 MESO 5 DC_FIFO 7 MESO 7 
VIDEO 
PLAYBACK 
IDLE 
POWER CONSUMPTION  
ARCH 1 
PO
W
ER
  (
m
W
)  
Figure 3.7: Power consumption of Arch.1 system conﬁgurations.
over network power) or to opt for a non-minimal route (with non-trivial im-
plications on communication performance and/or deadlock avoidance). This
chapter leaves the exploration of these optimizations for future work, and
searches for architecture design styles able to beneﬁt from current design
methodologies.
7.3 Power results
Figures 10.2, 10.3 and 10.4 report power consumption comparison between
the mesochronous and the plain multi-synchronous platform both in idle
and traﬃc conditions for the three Arch.x operating conditions described in
Section 5.1.
Given the homogeneity of clock rates in the Arch.3 scenario, this latter is
considered ﬁrst and highlights inherent architectural diﬀerences and their
role in determining power. As Figures 10.2, 10.3 and 10.4 show, this scenario
plays in favor of the mesochronous network. The plain multi-synchronous
platform results in 21% and 23% power consumption overhead in idle condi-
tion and during video playback respectively with respect to the mesochronous
counterpart.
The reason lies in the more complex control logic and the higher number of
3.7.3 Power results 51
0 
20 
40 
60 
80 
100 
120 
140 
DC_FIFO 5 MESO 5 DC_FIFO 7 MESO 7 
VIDEO 
PLAYBACK 
IDLE 
POWER CONSUMPTION  
ARCH 2 
PO
W
ER
  (
m
W
)  
Figure 3.8: Power consumption of Arch.2 system conﬁgurations.
0 
20 
40 
60 
80 
100 
120 
140 
160 
180 
200 
220 
DC_FIFO 5 MESO 5 DC_FIFO 7 MESO 7 
VIDEO 
PLAYBACK 
IDLE 
POWER CONSUMPTION  
ARCH 3 
PO
W
ER
  (
m
W
)  
Figure 3.9: Power consumption of Arch.3 system conﬁgurations.
52
Contrasting Multi-Synchronous MPSoC Design Styles for Fine-Grained
Clock Domain Partitioning: the Full-HD Video Playback Case Study
buﬀer slots of the dual-clock FIFOs in the plain multi-synchronous platform
with the respect to the mesochronous synchronizers in the alternative im-
plementation variant. Interestingly, when moving from 5 to 7 slot dual-clock
FIFOs, the plain multi-synchronous platform exhibits a relevant 15mW over-
head for the video playback, while the overhead for the mesochronous NoC is
marginal. This is an appealing property for future on-chip realizations, where
a dual-clock FIFO intensive design will be severely penalized.
On the contrary, the Arch.1 scenario adds the runtime conﬁguration of
the network in the power balance and plays in favor of the plain multi-
synchronous network. In this case, it saves around 32% of power in idle
condition and 31% for the video playback with respect to the mesochronous
counterpart. This is mainly due to the high operating frequency (500MHz) re-
quired by the mesochronous network to match the performance of the alterna-
tive architecture. Although mesochronous synchronizers are more lightweight
than the dual-clock FIFO synchronizers, the lower average operating fre-
quency of the switches in the plain multi-synchronous platform dominates
the ﬁnal power consumption ﬁgures.
This result is however strictly dependent on the interaction between operating
speeds of the switches (or of the mesochronous NoC) and packet routing.
In fact, the Arch.2 scenario provides opposite results because of the low
operating speed of the mesochronous NoC (200MHz). This parameter is key
to determining power eﬃciency of the NoCs, thus explaining the signiﬁcant
65% power consumption overhead of the plain multi-synchronous platform
both in idle and in traﬃc condition for Arch.2. When extending the dual-
clock FIFO synchronizers to 7 slots, this extension further increases the power
gap between the two platform solutions, although to a smaller extent with
respect to Arch.3 (around 3mW). This is due to the fact that the FIFO slot
overhead is oﬀset by the low operating speeds of many clock domains. Also,
whether the traﬃc ﬂows through high-speed or low-speed intermediate hops
makes power more or less sensitive to the FIFO size. The same considerations
apply to Arch.1.
Finally, the mesochronous NoC we are considering features a maximum skew
of 60% of the clock period between any two leaves of the top-level clock
tree. When iterating the place&route with a tighter 5% skew constraint, we
3.8 Conclusions 53
measured a power consumption increment across the inferred mesochronous
NoCs ranging from 4 to 6%. This result is in agreement with [50] and denotes
a promising option when the system size scales further up.
8 Conclusions
Although a mesochronous NoC will be increasingly area eﬃcient with larger
integration densities, power eﬃciency strictly depends on the operating con-
ditions. Dual-clock FIFO based solutions are severely penalized by those map-
pings that route performance-critical packets across slow intermediate nodes.
In this scenario, the mesochronous NoC can easily do a better job. Vice versa,
when the combination of routing decisions, mapping strategy and core speed
setting is such to put pressure on NoC performance, the mesochronous NoC
is forced to work at maximum speed to match the performance of the plain
multi-synchronous solution, thus resulting in power overhead.
Based on the performed experiments, we believe that mesochronous NoCs
have a room in multi-synchronous systems: they enable packet routing through
performance-homogeneous hops and work with traditional design method-
ologies. Vice versa, dual-clock FIFO intensive architectures will be hardly
aﬀordable for ﬁne-grained clock domain partitioning, especially considering
the buﬀer over-provisioning that silicon technologies will require to sustain
yield. Nonetheless, in those use cases where only few cores have to run at high
clock rates, they are appealing for the reduced operating speeds of many of
their clock domains. However, their successful exploitation requires a proper
upgrade of the power management and NoC mapping strategies (and, above
all, their co-optimization) to work around slow intermediate network hops,
although the performance-power trade-oﬀ in this case is still unclear and will
be the focus of our future work.
54
Contrasting Multi-Synchronous MPSoC Design Styles for Fine-Grained
Clock Domain Partitioning: the Full-HD Video Playback Case Study
Chapter 4
Mesochronous NoC Technology
for Power-Eﬃcient GALS
MPSoCs: Mesochronous vs.
Synchronous
1 Abstract
MPSoCs are today frequently designed as the composition of multiple volt-
age/frequency islands, thus calling for a GALS clocking style. In this context,
the on-chip interconnection network can be either inferred as a single and
independent clock domain or it can be distributed among core’s domains.
This chapter targets the former scenario, since it results in the homogeneous
speed of the NoC switching elements. From a physical design viewpoint,
the main issues lie however in the chip-wide extension of the network do-
main and in the growing uncertainties aﬀecting nanoscale silicon technolo-
gies. This chapter proves that partitioning the network into mesochronous
domains and merging synchronizers with NoC building blocks, two main
advantages can be achieved. First, it is possible to evolve synchronous net-
works to mesochronous ones with marginal performance and area overhead.
Second, the mesochronous NoC exposes more degrees of freedom for power
optimization.
56
Mesochronous NoC Technology for Power-Eﬃcient GALS MPSoCs:
Mesochronous vs. Synchronous
2 Introduction
Networks-on-chip (NoCs) are proving capable of easing the communication
bottleneck arising in multi-core computing platforms [91, 92, 93, 94], thus
overcoming the fundamental performance, power and physical design limi-
tations of shared and multi-layer busses. There is today little doubt on the
fact that a high-performance and cost-eﬀective NoC can only be designed
in 45nm and beyond under a relaxed synchronization assumption [94, 96].
One eﬀective method to address this issue is through the use of globally
asynchronous and locally synchronous (GALS) architectures, where the chip
is partitioned into multiple independent voltage and frequency domains.
Each domain is clocked synchronously while inter-domain communication
is achieved through speciﬁc interconnect techniques and circuits [95]. The
methodology of inter-domain communication is a crucial design point for
GALS architectures. One approach currently experimented in GALS NoC
prototypes consists of using purely asynchronous clockless handshaking for
transferring data words across clock domains [97, 98]. A few chip demonstra-
tors prove the viability of this solution [109, 111], but they have not achieved
widespread adoption of asynchronous NoCs in the industrial arena yet. In or-
der to more incrementally evolve current industrial practice, some previous
work in [101, 104] has developed synchronizer-based GALS NoC technology.
In particular, design techniques merging synchronizers with network build-
ing blocks (named the tightly coupled design style) have proved area, power
and performance eﬃcient with respect to loosely coupled solutions, where
synchronizers are placed as external blocks to NoC switches. All these previ-
ous work concerns architecture design space exploration and quality metrics
assessment of synchronizer-based communications at the switch level. This
chapter builds on these milestones and moves a step forward by taking the
network-level perspective. While the migration from fully synchronous paral-
lel systems to GALS systems with voltage/frequency decoupling between IP
cores is taken as a matter of fact in this chapter, there are signiﬁcant GALS
NoC architecture variants the designer can still choose from. The ﬁrst one
consists of placing NoC switches in the clock domains of the IP cores they
are connected with. In contrast, an alternative solution consists of inferring
4.2 Introduction 57
the on-chip network as an independent clock domain, disjoint from those of
the IP cores. In this scenario, dual-clock FIFOs need to be instantiated only
at the network boundary, since the network is synchronized by a single and
independent clock signal. The homogeneous performance of NoC switches,
the fewer amount of dual-clock FIFOs required and the possibility to have
an always on system interconnect fabric make this solution more attractive
to this chapter. However, the feasibility and eﬃciency of this solution is now
mainly on burden of the physical designer. In fact, he has again to deal with a
large synchronous clock domain (the NoC itself) distributed throughout the
entire chip. A workaround for this problem consists of inferring the network
as a set of mesochronous domains, instead of a global synchronous domain,
yet retaining a globally synchronous perspective of the network itself. The
granularity of a mesochronous domain can be as ﬁne as a NoC switch, which
is the case considered in this chapter. The communication between neighbor-
ing switches is then mesochronous as the top-level clock tree might not be
equilibrated. This brings the additional advantage that mesochronous syn-
chronizers are typically more lightweight than dual-clock FIFOs for use in
switch-to-switch links. This chapter leverages mature mesochronous com-
munication technology to perform a comprehensive crossbenchmarking of a
mesochronous NoC with a fully synchronous NoC for use in a GALS system.
Both networks share the same baseline MPSoC-oriented NoC architecture for
the sake of fair comparison. The tightly coupled design principle is followed
for mesochronous links, so that their unique optimization opportunities in
the NoC domain are fully exploited. The chapter relies on actual implemen-
tations on a 65nm industrial technology library and provides the assessment
of a wide range of design quality metrics, some of them of special interest for
nanoscale silicon technologies: performance, area, power, and clock tree syn-
thesis eﬃciency. This way, this chapter can provide useful guidelines for those
industrial designers currently committed to the development of next gener-
ation NoC-based MPSoCs. Concisely, the main contribution of this chapter
can be summarized as the crossbenchmarking of two GALS systems, the for-
mer implemented with a fully synchronous NoC; the latter leveraging a
mesochronous NoC. Both systems have been compared from an area and
power viewpoint. Since dual-clock FIFOs for connection to network inter-
58
Mesochronous NoC Technology for Power-Eﬃcient GALS MPSoCs:
Mesochronous vs. Synchronous
VOLTAGE AND FREQUENCY ISLAND
CORE
NETWORK INTERFACE
SWITCH SWITCH
SWITCHSWITCH
DC_FIFOs
VOLTAGE AND FREQUENCY ISLAND VOLTAGE AND FREQUENCY ISLAND
VOLTAGE AND FREQUENCY ISLAND
Figure 4.1: Plain multi-synchronous architecture based on dual-clock FIFOs.
MESOCHRONOUS
VOLTAGE AND FREQUENCY ISLAND
VOLTAGE AND FREQUENCY ISLAND
VOLTAGE AND FREQUENCY ISLAND
VOLTAGE AND FREQUENCY ISLAND
CORE
NETWORK INTERFACE
DC_FIFO
DC_FIFO
SWITCH SWITCH
SWITCHSWITCH
MESOCHRONOUS NoC
SYNCHRONIZERS
Figure 4.2: Mesochronous synchronization in a multi-synchronous architec-
ture.
faces are common to both solutions, they have not been considered in this
work. The remainder of this chapter is organized as follows. Section 3 will
present the GALS platforms under analysis whereas Section 4 will review
the architecture of the synchronizer block utilized as baseline to build the
mesochronous network. Section 5 will describe the method utilized to syn-
thesize the fully synchronous and the mesochrnous GALS system. Section
5 will present a comparison in terms of area, wiring and power overhead.
Finally, Section 6 concludes this work with a ﬁnal discussion and directions
for future work.
4.3 Target GALS Architecture 59
Mux Stall
clk_receiver
Flow Control
andData
Flow Control
Data andFlop_1
Flop_0
Flop_2
countercounter
counter
Front−end Back−end
clk_sender
Mux
Stall
counter
Flop_0
Flop_1
Flop_2
Flop_1
Flop_2
Mux
Stall
upstream
switch
clk_receiverclk_receiver
SWITCH INPUT BUFFER
counter
1−bit
synchronizer
To the Flop_0
Figure 4.3: Hybrid mesochronous synchronizer architecture.
3 Target GALS Architecture
A GALS-based design style ﬁts nicely with the concept of voltage and fre-
quency islands (VFIs), which has been introduced to achieve ﬁne-grain system-
level power management and is currently driving the transition of most MP-
SoCs to GALS systems. In these systems, if network components belong to
the core’s VFIs then performance of communication ﬂows would be deter-
mined by the slowest domain crossed on the way to destination. Also, in case
a VFI is shut down, global connectivity is jeopardized. An alternative solu-
tion is illustrated in Fig. 4.1, where the NoC lies in its independent VFI. This
way, performance of the whole switching fabric would be homogeneous, with
only boundary eﬀects to take care of. Also, the network would be loosely cou-
pled with the cores’ VFIs, and each core/cluster of cores could be shutdown
without any impact on global network connectivity. The main issue with an
independent NoC VFI consists of the feasibility of its clock tree. The reverse
scaling of interconnect delays and the growing role of process variations are
some of the root causes for this. Even though inferring a global clock tree
for the entire network will still be feasible for some time, it will probably
come at a signiﬁcant power cost. Moreover, it is unclear when the diﬃculty
of tightly and globally enforcing the skew constraint will truly become a
roadblock. However, a workaround for this problem does exist, as illustrated
60
Mesochronous NoC Technology for Power-Eﬃcient GALS MPSoCs:
Mesochronous vs. Synchronous
in Fig.4.2. The network could be inferred as a collection of mesochronous
domains, instead of a global synchronous domain, yet retaining a globally
synchronous perspective of the network itself. There are several methods to
do this. One simple way is to go through a hierarchical clock tree synthesis
process. In practice, a local clock tree is synthesized for each mesochronous
domain, where the skew constraint is enforced to be as tight as in traditional
synchronous designs. Then, a top-level clock tree is synthesized, connecting
the leaf trees with the centralized clock source, with a very loose clock skew
constraint. This way, many repeaters and buﬀers, which are used to keep
signals in phase, can be removed, reducing power and thermal dissipation
of the top-level clock tree. The granularity of a mesochronous domain can
be as ﬁne as a NoC switch block. The communication between neighboring
switches is then mesochronous as the clock tree is not equilibrated, while the
communications between switch and IP cores are fully asynchronous because
they belong to diﬀerent clock domains. Bi-synchronous FIFOs are therefore
used to connect the network switches to the network interfaces of the cores,
as showed in Fig.4.2. This synchronization paradigm comes with additional
advantages. First, it makes a conscious use of area/power-hungry dual-clock
FIFOs, which end up being instantiated only at network boundaries. Instead,
more compact mesochronous synchronizers are used inside the network, thus
minimizing the cost for GALS technology. Finally, unlike fully asynchronous
interconnect fabrics, the synchronizer-based source-synchronous GALS ar-
chitecture illustrated in Fig.4.2 is within reach of current mainstream design
toolﬂows with just incremental eﬀort. Typically, some scripting eﬀort within
commercial synthesis frameworks enables these latter to meet the physical
requirements of source-synchronous designs [100, 108]. In the rest of this
chapter, the architectures in Fig.4.1 and Fig.4.2 will be compared from many
viewpoints by means of physical synthesis runs, to quantify when exactly to
migrate away from the architecture of Fig.4.1 and the actual overhead of
the architecture of Fig.4.2. The xpipesLite NoC architecture [99] is used as
baseline experimental setting to implement both GALS platforms. The ﬂow
control protocol used by xpipesLite is stall/go: a forward signal, synchronous
with data, ﬂags data availability (valid), while a backward signal ﬂags a
destination buﬀer full (stall) or empty (go) condition.
4.4 Hybrid coupling of synchronizer with the NoC 61
4 Hybrid coupling of synchronizer with the
NoC
Usually, mesochronous synchronizers are just placed in front of the down-
stream switch (the loosely coupled design style). This has implications on
the size of the switch input buﬀer as well, which should cover the round
trip latency to sustain maximum throughput. Given the large buﬀering and
latency overhead of this approach, we proposed in [103] to bring the syn-
chronizer deeper into the downstream switch, as illustrated in Fig.4.3. The
reference synchronizer architecture receives as its inputs a bundle of NoC
wires representing a regular NoC link, carrying data and/or ﬂow control
commands, and a copy of the clock signal of the sender used as a strobe
signal for them. The circuit is composed by a front-end and a back-end.
The front-end is driven by the incoming clock signal, and strobes the incom-
ing data and ﬂow control wires onto a set of parallel latches in a rotating
fashion, based on a counter. The back-end of the circuit leverages the lo-
cal clock, and samples data from one of the latches in the front-end thanks
to multiplexing logic which is also based on a counter. The rationale is to
temporarily store incoming information in one of the front-end latches, using
the incoming clock wire to avoid any timing problem related to the clock
phase oﬀset. Once the information stored in the latch is stable, it can be
read, processed and sampled by the target clock domain. In the architecture
of Fig.4.3, the synchronizer output now directly feeds the switch arbitration
logic and its internal crossbar, thus materializing the tight coupling concept
of the mesochronous synchronizer with the switch architecture. The ulti-
mate consequence is that the mesochronous synchronizer becomes the actual
switch input stage, with its latching banks serving both for performance-
oriented buﬀering and synchronization. A side beneﬁt is that the latency of
the synchronization stage in front of the switch is removed, since now the syn-
chronizer and the switch input buﬀer coincide. The buﬀering overhead in the
switch input buﬀer because of ﬂow control is also removed accordingly. The
main change required for the correct operation of the new architecture is to
bring the stall/go ﬂow control signal to the front-end and back-end counters
of the synchronizer, in order to freeze their operation in case of a stall. While
62
Mesochronous NoC Technology for Power-Eﬃcient GALS MPSoCs:
Mesochronous vs. Synchronous
this signal is already in synch with the back-end counter, it should be syn-
chronized with the transmitter clock before feeding the front-end counter.
The backward propagating stall/go is then directly synchronized with the
transmitter clock available in the front-end by means of a similar but smaller
(1-bit) synchronizer. For this architecture solution, only 3 latching banks are
needed in the synchronizer front-end, since link latency has been minimized.
In practice, only 1 slot more than the input buﬀer in the fully synchronous
switch. A loosely coupled approach would require a 4 slot input buﬀer and a
3 latch banks synchronizer. As regards the control path, a 1-bit synchronizer
is replicated in front of the upstream switch. This synchronizer is not merged
with the downstream buﬀer, since this would give rise to overly tight timing
constraints [101]. In contrast, integrating only the data-path synchronizer is
denoted as the hybrid coupling and gives more guarantees for timing closure,
and is the approach taken hereafter.
5 Synthesis of GALS Platforms
Both the synchronous and the mesochronous platforms have been designed
to be seamlessly integrated into an industrial design ﬂow using commercial
tools for physical synthesis. Only standard cells are used and no full custom
components. The reference topology of our experiment is a 4x4 mesh net-
work where each switch is connected to either a core or a memory (of size
1.5mm). As far as the physical synthesis is concerned, the same bottom-up
methodology has been utilized for both platforms. Speciﬁcally, each network
switch has been placed and routed in isolation with a target frequency of
500MHz. The clock tree of each switch has been synthesized with a tight
skew constraint of 5% of the target clock period. Once the local clock tree is
characterized with its input delay, skew and input capacitance, a macromodel
is built in order to be used in the next design step. Furthermore, in order to
implement a hierarchical clock tree synthesis, a buﬀer has been inserted to
the input clock pin of each switch block. Once the switches have been placed
and routed, they are imported as macro blocks in the main network design
along with their libraries detailing both timing and physical characteristics.
The next step consists of performing a top-level clock tree synthesis by lever-
4.5 Synthesis of GALS Platforms 63
aging the switch macromodels previously extracted. In fact, this model can
be used to characterize the bottom clock tree given that these local clock
trees will not be modiﬁed by the place&route tool. Therefore, in order to
preserve the clock tree local to the switches, a ”PreservePin” tag must be
used in the CTS speciﬁcation ﬁle. Please notice that the hierarchical CTS
has been used both for the synchronous and the mesochronous platforms,
since this is a standard methodology for parallel hardware platforms. The
only diﬀerence is the skew constraint in the top level clock tree, which can
be loosened for the mesochronous design while should be tightly enforced
for the synchronous one. Final step of our hierarchical methodology consists
of routing the switch-to-switch links and performing parasitics extraction
for accurate static-timing analysis and power estimation. Timing closure for
both the synchronous and mesochronous NoC has been achieved at 500 MHz
by performing exactly the same physical synthesis steps.
? ?
???? ??????????????
?
???
???
???
???
?
???
???????????????? ???
???? ???????????? ???
Figure 4.4: Power consumption with no activity and with uniform random
traﬃc (normalized with respect to the mesochronous network).
64
Mesochronous NoC Technology for Power-Eﬃcient GALS MPSoCs:
Mesochronous vs. Synchronous
? ?
???????????????? ??? ???? ???????????? ???
?
???
???
???
???
?
???
????????????????????????????
?????? ??????????
Figure 4.5: Area and wiring intricacy (normalized with respect to the
mesochronous network).
6 Experimental results
Area and Wiring Overhead
Figure 4.5 reports post-place&route area and wiring statistics for the archi-
tectures under analysis. From an area viewpoint, both systems exhibit the
same footprint. More in detail, our baseline architecture (i.e., the fully syn-
chronous mesh) features a 2-slots input and 6-slots output buﬀers. On the
other hand, its mesochronous counter-part has 3-slots input buﬀer and ex-
actly the same amount of output buﬀering. Nonetheless, the area overhead
is identical. This is due to the fact that synchronization mechanisms, tightly
coupled in the input buﬀer, are implemented through latch banks, which re-
quire typically a smaller area footprint compared to the ﬂip-ﬂops adopted in
the input buﬀer of the baseline architecture. The ultimate result is an equal
area occupation in both platforms although this comes with a somewhat
more challenging testing framework. From the wiring point of view, a 23%
net saving is achieved by the fully synchronous platform. The reason lies in
4.6 Experimental results 65
the fact that the mesochronous platform features an additional clock wire per
output port utilized as strobe signal for data synchronization and a further
external single bit synchronizer for backward ﬂow control synchronization
instantiated in each of the 48 switchaˆtoaˆswitch channels of the network. Last
but not least, the slightly more complex network topology contributes to a
more complex structure of the clock tree.
Power analysis
By leveraging post-layout netlists and back-annotated switching activity, we
were able to achieve very accurate power ﬁgures. In fact, cycle-accurate sim-
ulations have been carried out with uniform random traﬃc as well as with
all the network switches in idle conditions. A value-change-dump ﬁle (VCD)
has been annotated from the simulations and consequently utilized to carry
out a very accurate power estimation with Synopsys PrimeTimePX. Figure
4.4 reports power consumption of both fully synchronous and mesochronous
networks. Idle power plays in favor of the fully synchronous network. This
is mainly due to the additional switchaˆtoaˆswitch clock wire used as strobe
signal for data synchronization. This result calls for further evolution of
mesochronous NoC technology, to implement a form of clock gating on these
lines. On the other hand, when stimulating the networks with a uniform ran-
dom traﬃc pattern, the mesochronous Network-on-Chip exhibits a smaller
power consumption with respect to the fully synchronous one. The reason
lies in the inherent architectural diﬀerence between the input buﬀer of the
mesochronous switch and of the synchronous one. In this latter, both ﬂip-
ﬂop banks are triggered at each clock cycle. Conversely, latch banks of the
mesochronous input buﬀer are triggered by an enable signal driven by a
counter. Since the counter logic enables only a single latch bank at a time,
the ultimate register power consumption of the mesochronous input buﬀer
is smaller than its synchronous counterpart. With a mesochronous NoC, an
interesting opportunity pointed by [105, 107] is to exploit hierarchical clock
tree synthesis to reduce power of the top level clock tree. The tuning knob
to materialize power savings is the relaxation of the skew constraint, so that
less buﬀers are instantiated in the top-level tree. We experimented this on
the 4x4 mesochronous mesh by incrementally relaxing the skew constraint.
66
Mesochronous NoC Technology for Power-Eﬃcient GALS MPSoCs:
Mesochronous vs. Synchronous
Given the relatively small system size, we constrained the top level tree to
be placed and routed outside IP core area, which captures the challenging
requirements of many real-life MPSoC designs.
7 Conclusions and Discussion
Evolution of MPSoCs to GALS systems is an ongoing process, driven by
the immediate need to decouple voltage and frequency of IP cores from each
other for power management. However, when a GALS NoC is implemented as
an independent clock domain with dual-clock FIFOs at the boundaries, then
physical designers have again to deal with a global chip wide clock tree (i.e.,
the one of the network itself). By capitalizing on mature mesochronous tech-
nology, this chapter compares a mesochronous NoC and a fully synchronous
NoC (both for use in a GALS system) in a systematic way. The lesson learned
from this experimental work can be summarized as follows:
1. A fully synchronous NoC can be evolved to a mesochronous NoC at no
area and latency overhead because of the hybrid coupling design style.
2.During network activity, the mesochronous NoC proves more power-eﬃcient
because of the inherent clock gating implemented at its tightly coupled syn-
chronizers in the switch input buﬀer. In contrast, a 20% higher standby power
is incurred because of the transmitted and continuously switching clock sig-
nals in source synchronous links. A clock gating technique applied to the
switch input buﬀer of the synchronous NoC and to the source synchronous
links of the mesochronous NoC may align power results of the two solutions.
In any case, the mesochronous NoC does not feature any signiﬁcant overhead
from a power viewpoint.
3. Hierarchical clock tree synthesis in a mesochronous NoC can potentially
reduce total power of the top level clock tree at the cost of progressively
loosening skew constraints in the same tree. However, power savings achieved
in this way are not signiﬁcant yet, for a number of concurrent reasons. First
of all, there is a gap between the required maximum skew and the obtained
one, since the CTS tool has been conceived for minimizing skew, and not
4.7 Conclusions and Discussion 67
for increasing it. Therefore, to take full advantage of this eﬀect, CTS tools
should be customized accordingly.
4. On the other hand, when tight skew constraints are required under chal-
lenging physical and timing constraints, the CTS tool is not able to meet the
target. In practice, this means that with current CAD tools it will become
rapidly impossible to enforce tight skew constraints in the top level clock tree.
Under these operating conditions, it is important to have an underlying ar-
chitecture with inherent skew robustness. In our experiments, mesochronous
NoCs prove capable of meeting this requirement1.
2
1This chapter has included contents that are referred to a cooperative and interdisci-
plinary work where furher details are in[50].
2This chapter has included contents that are referred to a cooperative and interdisci-
plinary work where furher details are in[50].

Chapter 5
Testing Archicteture on Top of
The First Variant of the Mesh
1 Abstract
The digital design convergence, together with the new usage models of mo-
bile devices, are raising the clear need for new requirements such as ﬂex-
ible partitioning, runtime adaptivity, reliability. In turn, such feature-rich
architectures make the testing challenge more severe. The above trend has
direct implications on the design of the underlying on-chip network, which
becomes not only the system integration framework, but also the control
framework executing hypervisor commands, or reacting to runtime operating
conditions. The ultimate challenge for the NoC is to co-design these features
together, while taking advantage of cross-feature optimization opportunities.
This chapter takes on this challenge and illustrates the design experience of
a NoC switch architecture serving as the key enabler for the next generation
of reliable and reconﬁgurable systems.
2 Introduction
Network-on-Chip (NoCs) design principles have recently reached a stage
where they start to stabilize[78, 79]. Unfortunately, the requirements on the
design of an embedded system as a whole are far from stabilizing. More
in detail, there is today an umistakable trend toward the implementation
70 Testing Archicteture on Top of The First Variant of the Mesh
of heterogeneous architectures in mobile systems, combining a host CPU
with a many-core programmable accelerator[80], or embedded GPGPU[81].
The provision of such acceleration by means of an array fabric of homoge-
neous processor cores yields unprecedented trade-oﬀs between ﬂexibility and
energy eﬃciency. As a result, eﬃcient exploitation of on-chip accelerators
might go through time-multiplexing or time-interleaving of concurrent appli-
cations, which would incur large penalties in terms of serialization or context
switches, respectively. It is therefore no wonder that partitioning is emerging
as the fundamental paradigm for operation of many-core programmable ac-
celerators, in order to pursue the integration of functionality from separate
users/virtual machines onto NoC-based many-core architectures. A static
partitioning scheme cannot keep up with the increased levels of adaptivity
of modern embedded systems, therefore ﬂexible partitioning should be the
target, that is, partitions can be arbitrarily set up or tore down at runtime,
and their shape may vary over time. A key enabler for ﬂexible partitioning
is represented by the ability to rapidly, safely and properly conﬁgure the
routing mechanism of the underlying on-chip interconnection network, and
to dynamically reconﬁgure it at runtime to expose the maximum level of
ﬂexibility to users. However, ﬂexible partitioning and runtime reconﬁgura-
tion is not the only requirement that fosters the next generation of on-chip
networks. Today, due to the increased variability of components and breadth
of operating environments, reliability becomes relevant to mainstream ap-
plications. Implementation of fault-tolerant communication over the on-chip
network is still a challenging task in spite of the extensive record of research
works in the ﬁeld[86], due to the potentially large overhead that may af-
fect the resulting architectures. In addition, a key property that novel NoCs
cannot miss is to guarantee a potentially fast path to industry, since NoC
deployment is today a reality. An important requirement for this purpose
is the eﬃcient testability of candidate NoC architectures. This property is
very challenging due to the distributed nature of NoCs and to the diﬃcult
controllability and observability of its internal components. When we also
consider the pin count limitations of current chips, we derive that NoCs will
be most probably tested in the future via built-in self-testing (BIST) strate-
gies. Although there is still ample room to research novel approaches to the
5.2 Introduction 71
speciﬁc challenges NoC architectures have to cope with, one key concern is
to co-design the solutions to diﬀerent challenges in an integrated framework.
In practice, interdependencies between diﬀerent NoC features should be de-
tected ahead of time so to avoid the engineering of highly optimized solutions
for speciﬁc problems, that however coexist ineﬃciently together in the ﬁnal
switch architecture. This chapter takes on the challenge of engineering a NoC
switch for the next generation of feature-rich, highly reconﬁgurable and re-
liable systems, with thorough cross-feature optimizations. The novel switch
design point features the following key properties:
• The routing function is reconﬁgurable at runtime, so to enable ad-
vanced network management policies such as network partitioning and
isolation, virtualization, selective power down of speciﬁc regions; the
network does not need to be drained for the sake of deadlock-free re-
conﬁguration.
• Reliability protection is guaranteed through fault-tolerant ﬂow control
and through an error detection, notiﬁcation and recovery infrastruc-
ture, with speciﬁc solutions for diﬀerent kinds of faults. Single event
upsets are detected and ﬁxed on-the-ﬂy, while intermittent faults are
detected and addressed via routing path reconﬁguration to avoid them.
• A BIST framework is set up targeting 97% coverage of single stuck-
at faults of the entire feature-rich NoC switch design. The framework
is conceived not only for post-silicon testing, but also for boot-time
testing of the device.
The distinctive contributions of this chapter are the following:
• We come up with a feature-rich NoC switch that makes an important
breakthrough in state-of-the-art NoC architectures;
• We show the interdependencies and smooth integration requirements
between the reconﬁguration framework, the fault-tolerance mechanism
and the testing strategy in the proposed NoC switch.
• We quantify the steep increase in complexity that the new requirements
demand for in the NoC switch architecture.
• We identify which NoC feature impacts switch area and critical path
the most through analysis of incremental variants of the switch archi-
tecture.
72 Testing Archicteture on Top of The First Variant of the Mesh
3 Baseline Architecture
The switch architecture proposed in this work is a major extension of the
baseline ×pipesLite switch [82], which targets the embedded computing do-
main with a very lightweight architecture.
The considered ×pipesLite variant implements logic-based distributed rout-
ing (LBDR): each switch has simple combinational logic that computes target
output ports from packet destinations and local switch coordinates. By means
of 26 conﬁguration bits for each switch (indicating switch port connectivity,
routing restrictions, and deroutes), the routing function can be reconﬁgured
at runtime[84]. The straightforward yet overly expensive way to make the
baseline switch fault-tolerant is through Triple Modular Redundancy. The
only advantage is that the TMR architecture can aﬀord keeping the native
STALL/GO ﬂow control unmodiﬁed. The baseline ×pipesLite architecture,
as well as its TMR extension, will be used as reference design points in the
experimental results.
4 Basic Design Choices for the New Switch
The proposed switch architecture is designed to be the basic building block
of a reconﬁgurable and fault-tolerant NoC. Reconﬁguration is achieved by
means of a global controller implemented in software, which requires com-
mand execution support in hardware. A dual network is therefore designed to
exchange control information between switches and the global controller. Re-
conﬁgurability is implemented as run time modiﬁcation of the routing func-
tion, in order to provide not only ﬂexible network partitioning when several
applications are concurrently executed, but also to avoid faulty links/regions
of the network. This latter functionality requires that points of failure are ﬁrst
detected, both at boot and run time, then notiﬁed to the system manager,
that triggers the reconﬁguration accordingly. The basic design techniques
to deliver the target func- tionality (boot-time testing, fault tolerance, con-
trol signalling, and routing function reconﬁguration) are hereafter described,
while their interdependencies are captured in section 5.
5.4 Basic Design Choices for the New Switch 73
Fault-Tolerance
Whether a fault-tolerance switching strategy should aﬀect the ﬂow control
protocol or not is a major design choice with high impact on the overall switch
architecture. The work in[87] derives error recovery strategies for the same
NoC switch both from a ﬂow control protocol with error notiﬁcation capabil-
ity (NACK/GO) and from another one lacking this support (STALL/GO).
Data retransmission is used in the former case, while the latter one can only
rely on error correction. It has been shown that NACK/GO potentially re-
sults in shorter critical path, more conservative area and lower peak power, at
the cost of a slight average power overhead. This led us to opt for NACK/GO
for the proposed switch architecture (Figure 5.1). The proposed solution tar-
gets single event upsets (SEUs). In the data path, detectors trigger ﬂit re-
transmissions from the sender buﬀer, which is preceded by correction of the
stored ﬂit in case it were corrupted in the buﬀer. On the control path, FSMs
are triplicated to avoid their permanent misalignment, while routing and ar-
bitration logic is just doubled, since dual-rail checkers (DRCs) can trigger
retransmissions from the input buﬀer upon mismatch detection.
Notiﬁcation Interface
In this work, we opt for a centralized approach to network control: a global
manager is in charge of network reconﬁguration decisions as an eﬀect of fault-
tolerance, power management or virtualization strategies. In order to address
the need for control signaling between network nodes and the global manager,
we revert to the dual communication infrastructure proposed in[89], where
the main NoC is extended with a ring which connects all the switches of the
main NoC together. The ring implementation implies the extension of each
switch with a simple routing primitive, which is an oversimpliﬁed version of
an input buﬀered switch.
Reconﬁgurability
The reconﬁguration mechanism of the routing function in the presence of
background traﬃc should provide deadlock freedom during the transition
from one routing algorithm to another, when extra dependencies may arise
74 Testing Archicteture on Top of The First Variant of the Mesh
and lead to deadlock. To cope with this issue, our switch leverages Overlapped
Static Reconﬁgurations (OSR), a technique which avoids draining network
traﬃc[88]. OSR was ﬁrst proposed for oﬀ-chip networks, and its customiza-
tion for a much more resource-constrained on-chip setting, named OSR-Lite,
has been performed in[83]. The basic principle is the following: if packets
with the old routing function are guaranteed to never go behind packets us-
ing the new routing function, then no deadlock cycles can occur. In OSR this
is achieved by triggering a token that separates old packets from new ones.
The token advances through the network hop by hop, following the chan-
nel dependency graph of the old routing function, and progressively drains
the network from old packets, allowing new packets to enter the network at
routers where the token already passed.
Boot-Time Testing Architecture
This work implements a BIST strategy for the new switch, targeting single
stuck-at faults. All network switches are tested and diagnosed in parallel,
thus cutting down on test application time and making it independent of
network size. We envision a unique 39 bit LFSR per switch that feeds pseudo-
random test patterns to every switch port in parallel. Test responses of a
switch sub-block are fed to connected sub-blocks, serving as test patterns for
them, as much as possible, depending on whether aggregate coverage remains
reasonable or not. This approach minimizes test pattern generator (TPG) and
test wrapper overhead. Link testing is performed in a cooperative way: test
patterns are injected by the upstream switch while diagnosis is performed
by the downstream one. The BIST architecture is depicted in Fig.5.2. Test
responses are collected by Multiple Input Shift Registers (MISRs). One MISR
performs signature analysis of test responses from the link, the input buﬀer
and the LBDR block together with the associated OSR-Lite logic. Another
one collects responses from a crossbar mux, an output buﬀer, a port arbiter
and the associated OSR-Lite logic. The testing framework is able to reveal
the correct position of faults inside the switch, since a MISR is dedicated to
each input and output port.
5.5 Cross-Feature Optimizations 75
5 Cross-Feature Optimizations
Supporting fault-tolerance, reconﬁguration, notiﬁcation and testing in the
same switch architecture gives rise to relevant design inter-dependencies,
summarized in Fig.5.3. They are either opportunities to exploit, or require-
ments to enforce to ensure correctness, as hereafter described.
A. The support for dynamic reconﬁguration of the routing function sug-
gests an eﬃcient approach to the detection of and recovery from wear-out
faults. These latter typically exhibit a progressive onset, consisting of fre-
quent transient faults aﬀecting the same circuit (intermittent faults) [90].
The NACK/GO infrastructure can detect transient faults and notify them
(and their location) to the global manager. The global manager can thus
monitor the frequency and the location of these events and eventually trig-
ger a reconﬁguration to exclude the aﬀected circuits before they become
permanently damaged.
B. The OSR-Lite mechanism for NACK/GO switches should accurately de-
ﬁne when a switch port can migrate to a new routing algorithm. We activate
this latter when the token is received, and the last packet of the old routing
function has left the port while getting an ACK back from the receiver side
for it.
C. Fault-tolerance support in the switch has a double eﬀect on testing eﬀec-
tiveness. On one hand it ends up reducing the observability/controllability
of signals and hence the testing coverage. Voters and correctors in fact mask
errors of associated modules. Moreover, error detectors natively prevent prop-
agation of test responses across the data path upon error detection. Hence,
they cannot serve as test patterns for downstream modules, thus limiting
fault observability as well as downstream controllability. Therefore, we iden-
tiﬁed some workarounds to preserve testing coverage. In order to improve ob-
servability, voter inputs are brought to diagnosis logic that detects whether
there are discrepancies or not. To increase controllability of the data path,
we feed it with fully random test patterns from the LFSR, which are with
high probability incorrectly encoded (we use BCH code for data protection).
Test wrappers prevent the detector from driving the buﬀer FSM at each cy-
76 Testing Archicteture on Top of The First Variant of the Mesh
cle, thus enabling from time to time the test pattern to go through the data
path. Another wrapper randomly enables the corrector from time to time.
When this happens, test patterns for the data path are not provided by the
LFSR, but by the corrector itself. On the other hand, fault tolerance has also
some beneﬁts on testing, that we exploited. In fact, replicated logic through
TMR and DMR provides a built-in support for diagnosis by enabling the
comparison of replicated outputs. Similarly, the error notiﬁcation output of
the decoder was brought directly to a MISR for diagnosis.
D. Having a testing strategy enables the fault-tolerance framework to take
the absence of permanent faults for granted. Having our testing strategy iter-
ated at every system bootstrap increases the conﬁdence level of this assump-
tion. Therefore, at every usage session, the user is informed about whether
the fault-tolerance logic has its full recovery capability from transient faults
at runtime, or whether it is degraded by permanent faults to some extent.
E. By having a notiﬁcation infrastructure in place, and a global manager
for control tasks, implies that switches do not need to notify each other
the outcome of testing and diagnosis. This avoids the implementation of
complex notiﬁcation protocols between the switches. Instead, after testing
completes at every system bootstrap, per-switch diagnosis bits are notiﬁed to
the global manager which processes them, gets global visibility, and programs
the routing function of the switches accordingly.
G. In the absence of a dual control network, overprovisioning of the main
NoC would be needed to support the reconﬁguration process. For instance,
in[88] a dedicated virtual channel is instantiated for this purpose. Vice versa,
we use a VC-less main network and a simpler dual network to convey key
reconﬁguration information.
F-H. The reconﬁguration and the testing support dictate the packet format
on the dual control network. In particular, packets are composed of 2 or 3
ﬂits. The ﬁrst ﬂit always contains information about the delivery time of
the packet (for fault statistics purposes), the type of information (i.e., BIST
testing result, reconﬁguration bits or transient fault notiﬁcation), and either
the source or the destination switch address. When BIST testing informa-
5.5 Cross-Feature Optimizations 77
tion are delivered to the global manager, then the second and the third ﬂit
contain respectively the diagnosis result and its negated for information time
redundancy. Diﬀerently, reconﬁguration packets back to switches require new
LBDR programming bits in both the second and the third ﬂit. Finally, on-
line fault notiﬁcation comes with only two ﬂits, with the second ﬂit reporting
the input-output port combination that triggered a transient fault.
I. As mentioned in point G, the ﬁrst system conﬁguration at bootstrap fol-
lows the BIST testing procedure, and does not take place in the presence of
background traﬃc. Therefore, we avoid token propagation of OSR-Lite at this
stage: and switches are programmed as they get the new routing information
through the dual network.
J. It is useless to localize faults inside modules that should be entirely dis-
abled regardless of the internal fault position. An LBDR-aware algorithm,
executed by the global manager, searching for routes on a given topology,
needs only information about the faulty links to generate a compatible rout-
ing function. As a consequence, we group switch logic blocks based on the
switch input or output port they belong to, and treat port failures as the fail-
ure of attached links. Thus, we end up having one diagnosis bit (pass/fail)
per switch port/link
K. As explained in point A, the switch is able to detect transient errors.
However, since we are targeting also intermittent faults, we notify the oc-
currence of transient faults to the global controller through ad-hoc control
packets, which also carry fault localization information.
L. The dual control network is a single point of failure, since the correct
conﬁguration of the system depends on its capability to safely deliver in-
formation from switches to the controller and vice versa. For this purpose,
we took the approach in [10] and combined fault-tolerance design techniques
with online testing strategies on the dual control network, at the cost of
some extra communication latency on it. If a global manager detects that
communication with a switch is not fully reliable, the switch is discarded,
the topology is modiﬁed and a new routing function computed. We architect
the dual network under conservative assumptions, so that the probability of
78 Testing Archicteture on Top of The First Variant of the Mesh
Figure 5.1: NACK/GO switch architecture.
a deviation between wanted and actual routing conﬁguration is marginal.
6 Experimental Results
All the logic synthesis runs performed in this work have been carried out by
means of a low-power standard-Vth 40nm Inﬁneon technology library.
Complexity Breakdown: Area Results
The following experiment points out both the complexity gap between the
native xpipesLite switch and the feature-rich extended one, and the area
increment that each integrated switch feature contributes. Normalized area
results are shown in ﬁgure , where features are incrementally added to the
baseline switch. This and its TMR extension are reported as reference design
points. Fault-tolerance is clearly the highest-impact feature. A non-negligible
5.6 Experimental Results 79
Figure 5.2: BIST-enhanced switch architecture.
area contribution comes in fact from detector and corrector modules and ac-
counts for almost 13% of the total area. When the reconﬁguration mecha-
nism is integrated into the NACK/GO switch, an 11% of area overhead is
introduced. The notiﬁcation system (TMR-protected dual network) results
lightweight (5% area overhead) since it takes advantages of the diagnosis
logic already made available for fault-tolerance purposes. Finally, the switch
capable of built-in self-test and self-diagnosis brings a 27% of area overhead,
which is the second major source of complexity after fault-tolerance. The
area penalty mainly comes from MISRs and test-wrappers. Despite the use
of pseudo-random test patterns, which typically save area with respect to de-
terministic ones in BIST approaches[85], the control logic to be tested is so
complex that test wrappers need to penetrate deeper into the switch architec-
ture to improve controllability and observability. However, when we consider
a baseline TMR switch which implements only fault-tolerance on top of a
baseline xpipesLite switch, we can see that the proposed switch (rightmost
bar in the plot) provides many more features at comparable area footprint.
80 Testing Archicteture on Top of The First Variant of the Mesh
??????????????? ???????????????
?
?
?? ?
?
???????????? ???????
?
?
?
? ?
?
Figure 5.3: Interdependency Diagram between Reconﬁguration, Fault-
tolerance, Testing and Notiﬁcation.
Complexity Breakdown: Delay Results
In order to evaluate the eﬀects of each additional feature on the switch propa-
gation delay, we performed a 5x5 switch synthesis for maximum performance
for all the 5 incremental solutions under test. Results are reported in Fig.5.5.
The fault-tolerant NACK/GO switch, the switch with OSR-Lite mechanism
and the switch with notiﬁcation system achieved a similar maximum oper-
ating speed. Finally, the testing framework degraded by 13% the maximum
performance of the NACK/GO switch. The performance of the switch is lim-
ited by the test-wrappers placed on the critical path. Considering the TMR
solution, this is around 30% slower than the baseline switch while the pro-
posed switch delivers far more functionalities at the cost of a longer critical
path (+13%).
Coverage for single stuck-at faults
Table 6 reports the total number of cells associated to each tested mod-
ule, and the related achieved coverage by testing. This latter was derived
5.7 Conclusions 81
Switch sub-block Cells Coverage
OSR-Lite 360 95.0%
Arbiter 334 96.7%
Crossbar 154 97.4%
Input Buﬀer 1272 97.6%
Output Buﬀer 1855 97.4%
LBDR 289 91.0%
TOT 4264 96.8%
Table 5.1: Coverage for single stuck-at faults.
by means of an in-house made gate-level fault simulation framework. Worse
results are obtained for the OSR-Lite mechanism (aˆ1
4
95%) and especially for
the LBDR (aˆ1
4
91%), as a direct consequence of their FSMs and combinational
logic complexity, respectively. Concerning the testing latency, a network com-
posed of the proposed switches, as assumed so far, would take 10.000 clock
cycles for testing, regardless of the network size.
7 Conclusions
In this chapter, we propose a fully testable switch architecture endowed with
fault-tolerance, notiﬁcation infrastructure and overlapped static reconﬁgura-
tion capability. We showed the major step in design complexity with respect
Figure 5.4: Area analysis.
82 Testing Archicteture on Top of The First Variant of the Mesh
Figure 5.5: Routing delay analysis.
to a state-of-the-art switch for low-to medium-end embedded systems, arising
from the more aggressive requirements on switch functionality. At the same
time, we showed that more functionality than TMR can be delivered within
the same area budget, but with a non-negligible speed penalty. Overall, this
chapter detected the interdependencies between the diﬀerent design features
and addressed them all in the coherently integrated ﬁnal switch microarchi-
tecture.
Chapter 6
Testing Architecture on Top of
the Second Variant of the Mesh
1 Switch Architecture
A parameterized n × m (n: number of input ports, m: number of output
ports) source based routing 2-stage switch has been designed for the applica-
tion speciﬁc computing domain, augmented with fault-tolerance provisions.
The scheme of the switch architecture is depicted in ﬁgure 6.2 and is com-
posed of the following main blocks:
- A fault tolerant Input buﬀer of two slots with triplicated control logic and
endowed with voters.
- A fault tolerant Output buﬀer of six slots with the same characteristics of
the input buﬀer.
- A fault tolerant Arbiter, triplicated and endowed with voters.
- A Path-Shift module and a Crossbar.
- Some comparators are placed in speciﬁc places for runtime diagnosis and
to notify the global manager.
The switch model was generated to be parameterizable to meet the require-
ments of application-speciﬁc topologies. Parameters that can be set include
the number of input and output ports and the width of the data portion of
the ﬂit. The width of the checkbit portion is derived accordingly.
We will now describe the behaviour of each block of the switch, then we will
present the testing architecture with experimental results.
84 Testing Architecture on Top of the Second Variant of the Mesh
1.1 Tightly Coupled Dc FIFO
Synchronization interfaces, such as dual-clock FIFOs, are typically instanti-
ated as external blocks with respect to the module they are connected with.
This aˆloose couplingaˆ of synchronizers with respect to NoC components im-
plies several drawbacks. First, the FIFO module introduces additional com-
munication latency in the intercommunication link. As a result, provisions
must be normally made since the ﬂow control signal may arrive multiple
clock cycles after the destination module decides to halt the source module.
The problem can be addressed by reserving space in the destination buﬀer,
thus incurring a signiﬁcant area and power overhead, or by enhancing the
dual-clock FIFO with ﬂow control capability.
In the EU-funded GALAXY project, the aforementioned problem was tackled
by merging the dual-clock FIFO with the switch input buﬀer, thus coming
up with a unique architecture block in charge of buﬀering, synchronization
and ﬂow control, and sharing buﬀering resources for all of these tasks. The
GALAXY project has also showed that this design principle, which we denote
as aˆtight couplingaˆ of synchronizer with the NoC, can be applied to dual-clock
FIFOs in a straightforward way. For this reason, the switch can optionally
replace its input buﬀer for a fully synchronous environment with a dual-clock
FIFO for a multi-synchronous environment, as illustrated in Figure 6.1. In
all cases, functional correctness is guaranteed.
A similar functionality can be easily implemented also in the switch provided
the dc FIFO is extended with the NACK/GO ﬂow-control protocol.
1.2 Input/Output Buﬀers
Input and output buﬀers are much simpler than the ﬁrst switch variant,
since they do not have to handle the Nack/go ﬂow control protocol but
rather the simpler stall/go one. The input buﬀer is sized with two slots,
which is the minimum amount of resources needed not to lose data during
stall activation. It was 3 with Nack/go. The output buﬀer can be arbitrarily
sized for performance buﬀering. As previously mentioned, control logic of
input and output buﬀers is triplicated for fault tolerance and endowed with
voters. An additional voter is placed on top of the data-path registers with
6.1.3 Probing System 85
Figure 6.1: Dual-Clock FIFO integration into one input port of the switch
architecture.
the purpose of voting the outputs from the three instances of the buﬀer
control logic. The voted output drives the read and write pointers of FIFO
data registers.
1.3 Probing System
At the same time, probes inserted in front of each voter sniﬀ their inputs
and inform (through a comparator and an OR gate) the global manager
about possible malfunctioning of each of the replicated branches. We ﬁnd it
important that the manager can keep this kind of information under control,
so to be aware of a possible degradation of the fault-tolerance capability of the
architecture. The OR gate collects the outputs of the comparators associated
with each voting stage, as well as a notiﬁcation signal from the correction
sub-system denoting whether correction actions have been performed or not.
Through the OR tree, a global notiﬁcation of malfunctioning is achieved for
each switch and notiﬁed to the global controller via a star interconnection
topology. In fact, we do not need a ﬁne-grain diagnosis like in the ﬁrst variant,
since if a switch component starts to fail repeatedly, we do not envision any
reconﬁguration course of action in the system nor it might be possible in the
86 Testing Architecture on Top of the Second Variant of the Mesh
application-speciﬁc hardware platform at hand.
1.4 Error correction
Error correction was the preferred fault-tolerance strategy given its capabil-
ity to stretch device lifetime as much as possible in the presence of permanent
faults. The Hsiao code was used to implement the corrector. However, dif-
ferently than the ﬁrst switch, an encoder needs to be integrated since with
source based routing the heat ﬂit needs to be changed. Therefore, parity
check bits for the new ﬂit need to be computed. Flit width can have diﬀerent
sizes and this depends on data width and on the number of parity checkbits.
For example, for a data width of 32, 48, 64 or 128, their respective compound
ﬂit width will consist of 39, 55, 72 and 137 bits. The number of checkbits
depends in fact on data width and is composed of 7, 7, 8 and 9 bits for re-
spectively a 32, 48, 64 and 128 bit data width.
These checkbits are computed by means of Hsiao encoder and are appended
to the data word before injection into the crossbar, as illustrated in ﬁgure
6.2.
A corrector module named ”Corr” and located after the data registers of
input buﬀers controls that the received checkbits match the computed ones
for data in transit. If checkbits are not the same, this means that the data is
corrupted and needs to be ﬁxed. The corrector is able to detect at least two
errors and to correct only one. The corrector is also endowed with a one bit
output that informs the global manager each time computed checkbits don’t
match.
After the corrector module, there is a pipeline stage to cut down on the
switch critical path due to additional error correction logic,changing critical
path delay into latency. The pipelined organization of the switch will en-
able in the future to replace the current corrector with more powerful ones,
without or marginally impacting the clock speed. In the presence of switch
pipelining, integration with ﬂow control is critical not to waste data and not
to incur throughput penalties. The switch implements a custom solution for
this purpose. A stall signal arriving from output ports through port arbiters
is brought both to the input buﬀer and to the single-slot pipeline registers,
where it serves as the enable signal. This way, we can avoid using two slot
6.1.5 Path Shifting 87
buﬀers as pipeline registers, as typically required for cascading stall/go retim-
ing and ﬂow control stages. In fact, the transmission can be frozen directly
in the upstream input buﬀer, while preserving the capability of the pipeline
register to store 1 ﬂit. At the same time, there is no performance penalty,
since ﬂow resumption is immediate and in case there is no pushing data from
input buﬀers the stall signal is not activated from the output buﬀers (i.e., we
never prevent the pipeline register to get used).
At the output of the pipeline stage, ﬂits are routed across two main paths:
- Towards the Path-Shift Module
- Towards the Arbiter Module
1.5 Path Shifting
The Path-Shift module is composed of the following blocks:
- A demultiplexer, immediately inserted after the output of the pipeline stage.
It is composed of two inputs (data input and select input) and two outputs
(one for head ﬂits and the other one for payload/tail ﬂits).
- A Shifter and an Encoder placed along the path followed by head ﬂits.
- A 2x1 multiplexer
When a new ﬂit arrives in front of the Path-shift module, we need to identify
the ﬂit type, i.e., whether it is a head ﬂit or not.
For doing this, the select input port of the mux/demux is directly controlled
by the ﬁrst bit of the input ﬂit.
In fact, this bit is set to ”1” for a head ﬂit and to ”0” for payload/tail ﬂit.
So, when the input ﬂit is a tail or a payload, path shifting is bypassed.
On the contrary, when a head ﬂit arrives, we need to shift the routing infor-
mation so that each switch can always ﬁnd in the same position its target
output port. Alternatively, we would need to embody in the packet the indi-
cation of how many hops the packet has already gone through. This way, the
switch would have to point every time to a diﬀerent location in the packet
head. After shifting the address bits, checkbits are not meaningful any more
and need to be recomputed by the encoder before the ﬂit can move on.
88 Testing Architecture on Top of the Second Variant of the Mesh
Figure 6.2: The Second architectural Variant of the Mesh at a glance.
1.6 Control Path
Arbitration is performed with a round robin arbiter with triplicated control
logic. Each instance of the arbiter is endowed with voters for self-correction;
additional voters are located on top of crossbar multiplexers for reconvergence
of the control path to the control inputs of the data path. Similarly to the
ﬁrst variant switch, a new arbiter state is saved only after voting it, to make
sure that triplicated FSMs do not get misaligned as an eﬀect of errors. This
would compromise reliability of the control path for future transactions.
2 Testing Methodology
The main challenge of testing an application speciﬁc system consists of the
highly heterogeneous nature of the system itself, where every core (or cluster
of cores) might be operated at a diﬀerent speed. For this purpose, the use of a
dual-clock FIFO between IP cores becomes mandatory. However, core speeds
are not typically ﬁxed, but they may vary within a range. After consulting
6.2 Testing Methodology 89
???????? ?????????
???????
???????
?????
?
??
?????????
?????
?????
?????
??????????????? ?????????????????
?
??
?????????
?????
?????
????
????
?????
????
Figure 6.3: Testing Architecture. In green, the test wrapper is pointed out.
with the industrial partners, we agreed on the feasibility of bringing all the
cores/clusters to the same speed for the sake of testing. This can be achieved
with frequency regulators or with a dedicated low-speed compact PLL for
system testing. Under these assumptions, we can assume to be able to test
the system under fully synchronous timing, as will be done throughout this
deliverable. Alternatively, the NaNoC design platform provided in deliverable
D1.4 a methodology to test multi-synchronous links through an asynchronous
handshake for the exchange of test patterns between frequency domains.
Due to requirement to limit the amount of area for the embedded testing
logic, and considering a milder approach to testing latency optimization, we
decided to adopt a testing approach based on pseudo-random test patterns.
The approach and the philosophy are pretty similar to the ones used for the
ﬁrst variant switch, except for a few architecture-speciﬁc customizations. Di-
agnosis is again performed with Multiple Input Shift Registers (MISRs).
The overall testing architecture is depicted in the picture of ﬁgure 6.3.
As we can see, each switch is endowed with one worst case LFSR of 39 bits.
The LFSR feeds through a test wrapper all the crossbar multiplexer (”MUX”)
inputs with a 6 bits shift between any two adjacent inputs. This is needed to
90 Testing Architecture on Top of the Second Variant of the Mesh
Figure 6.4: Area overhead @500MHz.
get a satisfactory coverage for the multiplexer.
Meanwhile, the same LFSR drives the select input ports of each multiplexer
of the crossbar with a 1 bit shift position between each multiplexer, again to
stimulate all cells and improve the coverage.
At the same time, the LFSR drives the stall input and the valid input of
each input/output buﬀer, and the local ID ﬂags, busy in and valid in of each
arbiter.
The data inputs of each arbiter are driven by test vectors from the upstream
switch, again implementing the principle of cooperative and parallel testing
between NoC switches.
A MISR placed after the arbiter and before the crossbar multiplexer of the
downstream performs the diagnosis.
3 Experimental Results
This section describes the experimental results of a 5x5 switch (second vari-
ant) synthesized at the target speed of 500MHz with the 40nm low-power
SVT Inﬁneon technology library. Input buﬀers are assumed to be fully syn-
chronous.
Figure 6.4 shows the area overhead of the above switch (rightmost bar) with
respect to an intermediate implementation without any testing support and
to a baseline TMR extension of the xPipeslite switch (see D2.1). It can be
6.3 Experimental Results 91
Figure 6.5: Testing overhead and area Breakdown.
observed that area overhead for testing amounts to only 12.96%. At the same
time, more functionalities and provisions than the TMR switch are delivered
at a much lower area footprint. Figure 6.5 illustrates the area breakdown
of the testing logic in the switch. The major contributor to the testing logic
comes from the MISRs used to perform the diagnosis and counts for ∼8.50%.
The remaining part of the overhead is spread among the wrappers and the
LFSRs used as test patterns generators.
Figure 6.6: Normalized routing delay @Max Performance.
92 Testing Architecture on Top of the Second Variant of the Mesh
Switch sub-block Cells Coverage
Arbiter 393 96.43%
Multiplexer 144 99%
Input Buﬀer 803 99.5%
Output Buﬀer 640 98.7%
Encoder 136 99.6%
Shift 56 100%
TOT 2172 98.7%
Table 6.1: Coverage for single stuck-at faults.
The number of cycles required to reach 98.7% coverage of single stuck-at
faults in the whole switch is ∼10.000. We found this a satisfactory trade-oﬀ
given the use of pseudo-random test patterns.
Table 6.1 reports the number of cells and the coverage of each switch sub-
blocks. We used the same fault simulation framework as for the ﬁrst variant
switch. The overall data-path can achieve a coverage of ∼99.19%, while the
remaining control-path is ∼96.43%; this is due to the complexity of the FSMs
inside the arbiter.
Figure 6.6 ﬁnally shows the implication of the switch features on its critical
path. The testing logic aﬀects the maximum operating speed of the switch.
In fact, routing delay becomes ∼8% larger than the one of the switch without
the testing logic. Nevertheless, the maximum operating speed achievable is
still higher than that of the TMR reference solution.
Last but not least, when replacing the input buﬀer with a dual clock FIFO, in
practice there is no area overhead provided we keep the number of buﬀer slots
the same. Actual buﬀer sizing then depends on network-level requirements
such as the speed ratio between sender and receiver as well as the needed
throughput across a multi-synchronous link[52].
Chapter 7
Ultra-Low Latency NoC testing
via Pseudo-Random Test
Pattern Compaction
1 Abstract
This chapter aims at devising an optimized pseudo-random test methodology
for NoCs and its architectural support. The guiding principle consists of using
a test pattern compaction engine for generating minimal test lengths. We
show the application of this principle driven by the objective to minimize
test application time, at the cost of test wrapper complexity. The achieved
design point results in a reduction of test application time by two orders
of magnitude with respect to state-of-the-art test architectures for NoCs
exploiting pseudo-random patterns.
2 Introduction
All systems-on-chip (SoCs) should be tested for manufacturing defects. Such
testing procedure turns out to be particularly challenging for those large
scale SoCs making use of a network-on-chip (NoC) as their communication
backbone: the controllability/observability of NoC links and sub-blocks is rel-
atively reduced since they are deeply embedded and spread across the chip.
94
Ultra-Low Latency NoC testing via Pseudo-Random Test Pattern
Compaction
This issue adds up to the newer challenges of testing generic large digital
designs in nanoscale technologies. For instance, pin-count limitations restrict
the use of I/O pins dedicated for testing. Other concerns regard the use of
external testers, which has been a mainstream testing practice so far: lack
of scalability of test data volumes and high cost for full clock speed testing.
Finally, wear-out mechanisms such as oxide breakdown, electro-migration
and mechanical/thermal stress become more prominent in aggressively scaled
technology nodes. These breakdown mechanisms occur over time, therefore
the methodology and the infrastructure used for production testing should
be designed for re-use during the system lifetime as well. This again urges
new testing strategies. Built-in Self-Testing (BIST) to some extent overcomes
the above problems since test patterns are generated and evaluated on chip.
By exploiting the precise knowledge of the architecture and of the circuits
under test, a designer can come up with deterministic test patterns, thus
potentially resulting into minimized test sequences and superior coverage
results. Unfortunately, engineering such handcrafted deterministic patterns
is a largely manual and time consuming task which should be performed
again in case of technology library migrations or circuit modiﬁcations. More-
over, placement and routing of the design will certainly modify the gate-level
netlist, thus making coverage expectations not fully trustworthy.
For synthesized logic, which is by far a relevant part of an embedded sys-
tem, pseudo-random test patterns are frequently used because of their higher
ﬂexibility. They potentially result in a lightweight test architecture due to
the simplicity of the linear feedback shift registers (LFSRs) and of the mul-
tiple input signature registers (MISRs) used to generate test patterns and
signatures respectively. The work in [50] proves a 20% area saving in 45nm
technology when a BIST architecture for NoCs is fed by pseudo-random pat-
terns rather than by handcrafted deterministic ones. Unfortunately, testing
with pseudo-random patterns also typically dominates the total runtime of
the BIST: in [12] 200000 cycles are reported for testing the main modules of
a NoC switch with such patterns. A reduction of one order of magnitude in
testing latency has been proved in [50], however authors there take a hybrid
approach, where pseudo-random test patterns are combined with determin-
istic ones and with architecture-speciﬁc DfT optimizations.
7.3 Related Work 95
In this chapter, we view the ﬂexibility and the reduced test latency require-
ments as fundamental for eﬃcient NoC testing. On one hand, re-engineering
deterministic patterns for each product evolution or technology migration is
not cost-eﬀective. On the other hand, overly long test application times are
not compatible with the lifetime testing paradigm, where a testing procedure
may be run at least at each system bootstrap.
As a result, the main objective of this chapter is to develop an ultra low-
latency testing framework for NoCs capable of achieving such unprecedented
test application times (below 250 cycles regardless of the network size) with-
out reverting to deterministic patterns. In contrast, we start from pseudo-
random patterns and develop a testing methodology and its architectural
support in the NoC that preserve the generic and ﬂexible nature of the pat-
terns. Such methodology and test architecture are therefore pretty general
in scope and can be applied to any NoC architecture other than the one
considered in this chapter. Test set compaction is at the core of our testing
framework. In order to minimize the test application time we chose to com-
pact test patterns for combinational logic blocks, where compaction eﬃciency
is more likely to outperform that achievable for sequential circuits. Registers
are tested with standard techniques. Combined with the concurrent testing
of switch sub-blocks, this approach achieved an order of magnitude lower test
application time than current literature.
The trade-oﬀ is clearly with the implementation complexity of the test wrap-
per, which needs to isolate the circuits that are concurrently tested and to
connect them to their test pattern generators (TPGs) and response analyzers.
This chapter proposes also an optimization strategy to limit such overhead
which consists of cascading several circuits under test (test responses of one
block become test patterns for the next one) and of exploiting the synergies
between them.
3 Related Work
As the integration densities keep increasing, on-chip interconnection networks
are becoming the reference communication backbone for multi-core comput-
ing in many embedded high performance systems [10], [7]. However, defects
96
Ultra-Low Latency NoC testing via Pseudo-Random Test Pattern
Compaction
still continue to increase and are more prominent in scaled technology. To
cope with this high defect rates, many test mechanisms have been proposed.
For example, [15] and [19] propose a test mechanism for regular and modular
systems like on-chip networks, but they incur a considerable area overhead.
In the same way, [18], [16], [20] and [17] have proposed full-scan and bound-
ary scan strategies, but they still have a high area overhead as showed in [14].
An alternative approach to reduce this overhead could be the partial scan
technique, however its testing time is the main drawback (tens of thousands
of clock cycles).
Another solution consists of applying test patterns from the input/output
borders of the network [8]; this approach has been extended in [9] in such a
way to support the diagnosis too. However, this approach is limited by the
high number of necessary test pins.
[3] proposed a testing framework based on handcrafted deterministic test
patterns and exploits the inherent structural redundancy of NoC switches.
This deterministic approach leads to one of the fastest testing times reported
in the literature for single stuck-at faults. However, this approach requires
the in-depth knowledge of the architecture under test and an extensive ef-
fort to carefully engineer handcrafted test patterns for it. On the same NoC
architecture of [3], the work in [50] implements a test architecture fed by
pseudo-random patterns. The achieved design point cuts down on the area
overhead by 20% in 45nm technology but provides comparable coverage in
one order of magnitude more test application time. On a diﬀerent switch ar-
chitecture, [12] reports several tens of thousands of cycles for pseudo-random
testing of most parts of the switch, conﬁrming that the use of such patterns
inherently plays against testing latency.
In this chapter, on the same NoC architecture of [3] and [50] we provide
a new design point, which achieves a high fault coverage with generic test
methodology steps and architecture design techniques (i.e., no deterministic
patterns). We rely on existing test set compaction tools (namely [4] and [5])
to cut down on testing latency, and elaborate on the test methodology and
on the architecture support needed to achieve unprecedented values of test
application times. The key challenge this chapter deals with is to identify the
most suitable circuits for test set compaction, so to maximize compaction
7.4 Testing methodology 97
eﬃciency and minimize test application time. At the same time, the imple-
mentation complexity of test wrappers, TPGs and response analyzers are
kept under control.
4 Testing methodology
The philosophy behind this work is that pseudo-random test patterns are de-
sirable for NoC testing since they can be easily reused and extended across
architecture variants and technology migrations. In fact, they save the con-
siderable eﬀort and time to develop handcrafted deterministic patterns for
the architecture under test.
On the other hand, we believe that some optimizations with respect to their
naive application to NoCs are necessary to reduce test application time. This
chapter pushes this consideration to the limit and tackles the challenge of
materializing an ultra-low latency testing framework for NoCs start-
ing from pseudo-random test patterns. The applied optimizations then
retain the generic and ﬂexible nature of the testing methodology by never
reverting to deterministic test patterns.
We found test set compaction a suitable architecture-agnostic step to opti-
mize test patterns, where the speciﬁc architecture implementation comes into
play only in determining the achieved compaction eﬃciency. In this chapter
we applied state-of-the-art compaction tools from the Turbo Tester [23] suite
for handling test patterns for the modules of NoC switches. To note that com-
paction tools have more degrees of freedom for test set optimizations when
they are fed by long test sequences that most likely stimulate the same error
multiple times. Thus, the choice of pseudo-random test sequences is an ideal
target for this case since these latter are commonly longer than deterministic
sequences generated by any ATPGs.
In order to obtain minimal test lengths, the following strategy was selected.
First, long pseudo-random test sequences were generated till the
obtained coverage for single stuck-at faults was 99%. Then, an ef-
ﬁcient static test set compaction tool (Optimize [4]) was run by
requesting it to derive a compressed test set tracking the previ-
ously obtained coverage for each tested module. Finally, a veri-
98
Ultra-Low Latency NoC testing via Pseudo-Random Test Pattern
Compaction
ﬁcation step of the achieved coverage with the new test set was
performed. The compacted vectors are then easily hardwired in
hardware TPG.
A key issue for our testing framework was to properly identify the circuit
blocks the methodology should be applied to. The guiding principle in making
this choice was to achieve the lowest possible testing latency for the NoC as
a whole. This was pursued in two ways:
• Testing of NoC switches was engineered in such a way that all switches
can be tested in parallel. The key enabler was to implement a coopera-
tion mechanism between switches for testing their inter-switch links. In
practice, each switch sends test patterns across its outgoing links and
response analysis will be performed in the neighboring switches. The
opposite holds for incoming links, which are analyzed locally. At the
same time, switch internal blocks are tested in parallel, thus avoiding to
create dependencies between testing phases and enabling the maximum
testing parallelism.
• Test set compaction for combinational logic is intuitively simpler and
potentially more eﬀective than that for sequential circuits. There are
a number of reasons for this. First, the compaction algorithm should
deal with simple test vectors rather than with test sequences. Secondly,
testing sequential logic of switch FSMs presents diﬀerent requirements
with respect to combinational logic testing. In fact sequential logic has
fewer inputs than combinational logic but needs more clock cycles to
be stimulated, thus requiring a suitable wrapper properly sequencing
the outputs of the test generator. For these reasons, our goal of min-
imizing test application time motivated the choice for splitting each
switch sub-block (essentially arbiters, buﬀers and crossbar) into their
combinational logic and registers for the sake of testing. Then, registers
were tested with well-known techniques (fundamentally, comparison of
test responses or built-in scan-chains for self-testing), while the test set
compaction methodology was applied to combinational logic.
The choice of isolating combinational logic for its eﬃcient testing has a rele-
vant impact on FSMs. In fact, their state registers are tested separately and
their current state signals feeding the combinational logic should be made
7.5 Baseline Switch Architecture 99
controllable to the TPG. In turn, the outputs and the next state signals from
the combinational logic should be made observable to the response analyzer.
For testing the registers, we adopted the most convenient standard approach
depending on the register type (state registers, conﬁguration registers and
data registers).
The above design decisions paved the way for a test architecture for NoCs
aiming at competitive coverage of single stuck-at faults and unprecedented
test application times with respect to current literature at the cost of test
wrapper complexity. However, the choice of pseudo-random test patterns
and of their compaction poses the foundation for low-overhead TPGs and
response analyzers, thus partially counterbalancing the footprint of the test
wrapper. The achieved trade-oﬀ and its comparison with state-of-the-art so-
lutions will be quantiﬁed in the experimental results section.
5 Baseline Switch Architecture
A 5x5 xpipesLite switch [6] has been used in this chapter as the baseline
design point without any testing support. The main blocks of this switch
are illustrated in Figure 7.1. For each input port, a 2-slot input buﬀer and a
routing module are instantiated. We use logic-based distributed routing [22],
which computes target output ports by means of a simple combinational logic
for each packet head. In order to provide implementation support for diﬀerent
routing algorithms, the logic is fed by 26 conﬁguration bits per input port. For
each output port, a port arbiter and a 6-slot output buﬀer are instantiated.
Also, a 5x1 Multiplexer is placed in front of each output buﬀer, globally
building up the switch crossbar. The switch implements wormhole switching
and the stall/go ﬂow control.
6 Testing Architecture
Next we present in detail all the changes applied to each block of the archi-
tecture. We will analyze respectively the testing scheme of the LBDR, Ar-
biter, Output buﬀer, Input buﬀer and Crossbar. After analyzing each block
standalone, we will perform an optimization by merging some test phases
100
Ultra-Low Latency NoC testing via Pseudo-Random Test Pattern
Compaction
? ?
???????????????
?
???????????????
???????????????
???????????????
?
???????????????
???????????????
???????????????
???????????????
???????????????
???????????????
???????????????
???????????????
Figure 7.1: Baseline Switch Architecture.
together.
6.1 LBDR testing
Each LBDR routing module is split into two main blocks in such a way to be
able to apply the key concept of our approach (see Fig. 7.2). The ﬁrst block is
composed of the LBDR conﬁguration registers (FF i blocks of ﬁg.7.2), while
the second one is the LBDR combinational logic (Combinational block of
ﬁg.7.2). Combinational logic computes the information contained in the con-
ﬁguration registers. Conﬁguration registers contain information about the
routing restrictions of the routing algorithm (Rbits), the connectivity of the
switch ports (Cbits), the local switch ID in the topology (Sid) and about
deroutes to be taken in some special cases (Dbits) [22]. In test mode, these
informations are randomized by the TPG, implementing the compacted vec-
tors. The LBDR logic reads also the destination address (11 bits) from the
head ﬂit, which needs again to be randomized in test mode. The compacted
vectors codiﬁed in the TPG Optimized block of ﬁg.7.2 are used to test the
combinational block of the LBDR. Before reaching the block under test, the
compact test set has to cross ﬁrst the wrapper placed in front of it. Response
analysis is performed by comparators exploiting the output of the 5 switch
7.6.2 Arbiter testing 101
????
???? ????
???
?????????????
?????????????
??????????
?????????? ??????
??????? ?????????
???????? ????????
??????????????????????
???????? ????? ????????
???????? ????? ?????????
???????? ????? ????????
???????? ????? ????????
?????? ????? ????????
????
????
????
????
????????????
?????
????
??????
?????????? ???
????????
????
?????
??????? ??????????????????
Figure 7.2: Lbdr Testing Architecture.
ports, but nothing prevents from using MISRs. In both cases, the faulty
routing module can be easily identiﬁed. All switch routing modules share the
same TPG.
The scan-chain approach has been adopted for testing the registers of the
LBDR. A 1 bit test pattern generator was used to inject a sequence of 0s/1s
along the scan chain. A response analyzer, placed at the end of the scan
chain, receives and analyzes the bit response after some amount of cycles
(depending on the number of registers to cross). The conﬁguration registers
of the same LBDR modules are tested in a sequential manner. The testing of
the ﬁve LBDR conﬁguration registers occurs in parallel; they share the same
TPG and counter but have diﬀerent analyzers.
Finally, a BIST control engine drives the select bits of the test wrapper.
6.2 Arbiter testing
We distinguish between combinational logic and state registers also in the
arbiter FSM. The testing diagram is depicted in Fig.7.3.
The arbiter state registers are tested in the same way as those of the in-
102
Ultra-Low Latency NoC testing via Pseudo-Random Test Pattern
Compaction
?????????????
?????????
?????????
???????
????????
???????
???????
?????????????????
??
??
????
???? ????
???????????
?????????? ???
??????????
?????????? ??????
??
Figure 7.3: Arbiter Testing Architecture.
put/output buﬀers (see Section 6.3).
For the combinational block, besides connecting its primary inputs to the
optimized TPG, we broke the feedback loop of the FSM in order to increase
the controllability of the block. A comparator is again used for response
analysis, thus exploiting the multiple instantiation of arbiters in the switch
and their concurrent testing.
6.3 Output buﬀer testing
The output buﬀer belongs to the data path of the switch but it internally
consists of an actual data path (data registers with selection input demux
and output mux) and of a control FSM driving the read and write pointers
of the mux and demux. For the sake of testing, these two internal blocks have
been separated. The test architecture is illustrated in Fig.7.4.
The combinational block of the control-path is tested with compacted vec-
tors generated by the Turbo Tester tool. State registers of the FSM could
7.6.4 Input buﬀer testing 103
????
?????????????
????
? ? ?
??????????
?????????? ??????
??????
????????
??????
????????
?????????????????????
??????
????????
?????????
??????????
?????????? ??????
?????????
?????????
???????
????
?????
??? ??
????
??????
?????????? ??????????
??????? ???????
Figure 7.4: Output Buﬀer Testing Architecture.
be tested as previously illustrated for the LBDR. However, here we present
a further opportunity consisting of feeding the combinational logic outputs
to the registers and analyzing their outputs by means of cycle-by-cycle com-
parators with the same outputs from the other switch buﬀers. Even in this
case, full coverage is guaranteed. In order to test the data path registers, we
used a 32 bit register where the outputs are inverted and connected back to
the inputs, thus generating a sequence of 0s and 1s. Due to the fact that the
output signals from the control-path are not random enough, we opted for a
small LFSR to drive the read/write pointers of the data path. Both pointers
are connected to the same LFSR outputs, so to coherently swap all register
banks.
6.4 Input buﬀer testing
The input buﬀer is tested exactly in the same way as the output buﬀer. The
testing diagram of this element is identical to the testing scheme of output
buﬀer as described in ﬁg.7.4. The only diﬀerence is the number of slots (2 for
the input buﬀer and 6 for the output buﬀer).
104
Ultra-Low Latency NoC testing via Pseudo-Random Test Pattern
Compaction
???
??
? ?
?
??
? ?
??
? ?
??
???????
???? ???????????
?
?
?
?
?
?
???
?????
?????
?????
?????
?????
???????
???????
???????
???????
???????
??? ???
???
???????
??????
??????
???????????????????
Figure 7.5: Testing Architecture for Crossbar Multiplexers.
6.5 Testing Multiplexers of the Crossbar
The testing infrastructure adopted to test the multiplexers of the crossbar is
depicted in Fig. 7.5. The TPG is a 5 bits maximal-length LFSR. The 5 bits
LFSR generates all the patterns necessary to stimulate all the possible states
of the multiplexers in less than 32 cycles. In particular, the relevant patterns
for the testing of a 5x1 multiplexer are the following: 10000, 01111, 01000,
10111, 00100, 11011, 00010, 11101, 00001 & 11110. Each pattern generated
by the LFSR feeds a block called “select-mux“. This latter is able to select the
multiplexer input port carrying the logic value that is negated with respect
to the other 4 input values. Finally, the diagnosis is performed by means of
a comparator exploiting the output of the 5 multiplexers of the crossbar.
6.6 Testing Infrastructure Optimization
After testing each block independently, we cascaded the circuits under test
exploiting the synergies between them in order to cut down on test wrapper
and TPGs complexity.
7.6.6 Testing Infrastructure Optimization 105
????
??????????????? ?????????????????
Figure 7.6: Cascaded Testing Architecture.
The cascade is composed by the following modules:
• The Crossbar of the upstream switch
• The Output buﬀer of the upstream switch
• The Inter-switch link
• The Input buﬀer of the downstream switch
At the beginning of the cascade, we inserted a 5 bits LFSR (as described in
Figure 7.5). Since test responses of one block become test patterns for the
next one then the LFSR is responsible for the injection of test vectors for the
whole cascade.
Such optimization allowed us to remove the input/output TPGs based on
registers previously located in front of the input/output buﬀer. In the same
way, we removed both the comparators located after the multiplexer and
the output buﬀer. Finally, the comparator located after the input buﬀer in
the downstream switch is the only preserved. The cascade testing scheme is
presented in Fig.7.6.
To test the communication link, we synchronized the injection rate of some of
the test patterns generators. In this speciﬁc case, we have three independent
TPGs to synchronize:
-The TPG in front of the multiplexer input ports
106
Ultra-Low Latency NoC testing via Pseudo-Random Test Pattern
Compaction
-The two LFSRs of the read/write pointers of the input/output buﬀer
The synchronization must be performed in such a way to ensure the maximum
coverage of all the blocks that belong to the cascade.
The multiplexer TPG starts to inject the test data as soon as the test mode
is selected. The injection length depends on the number of cycles necessary
to maximize the coverage of the cascade. Meanwhile, both the LFSRs re-
spectively lying in the output buﬀer upstream switch and in the input buﬀer
downstream switch are clocked each 31 cycles. Clearly, they must be clocked
at least six times (i.e. the number of slots of the output buﬀer) in order to
allow the data tests to cross all the data-path registers.
7 Experimental results
This section presents the experimental results for a 5x5 NoC switch synthe-
sized at 600 MHz in a 65nm industrial technology library. For the sake of
comparison, two test architecture variants for the same baseline switch are
available from previous work and used in this chapter to assess the trade-oﬀs
of the new design point proposed by this chapter. Therefore, our approach
with optimized pseudo-random patterns is contrasted with the handcrafted
deterministic test patterns used in [3] and with the non-compacted pseudo-
random patterns used in [50], and above all with their enabling test architec-
tures. In this comparison we do not consider ATPG generated test patterns
since it has been demonstrated in [3] that these latter are not competitive
with the handcrafted deterministic ones from a coverage viewpoint. This
chapter does not consider solutions providing less than 98% coverage on the
considered NoC architecture.
7.1 Area Overhead
Fig.7.7 illustrates the area overhead of three BIST solutions for the xpipesLite
switch (the one of this chapter, the one with handcrafted deterministic pat-
terns [3] and the one with non-compacted pseudo-random patterns [50]) nor-
malized with respect to the deterministic switch. As we can see, around 11%
7.7.1 Area Overhead 107
0 
0,2 
0,4 
0,6 
0,8 
1 
1,2 
PSEUDO_RANDOM DETERMINISTIC PROPOSED 
TPG 
MISR 
COMPARATOR 
WRAPPER 
SWITCH 
A
re
a 
O
ve
rh
ea
d 
Figure 7.7: Area Overhead: Deterministic [3], Pseudo-Random [50] and Pro-
posed Compacted Pseudo-Random Approach.
of the area overhead of the deterministic approach comes from the wrapper,
needed because of the diﬀerent test phases that this approach requires, in ad-
dition to switch sub-block isolation for the sake of testing. Another 7% comes
from TPGs, which encode the handcrafted test patterns, and marginally from
diagnosis logic and the BIST manager. Finally, the comparator tree used for
response analysis takes around 13% of the area.
When the test architecture is reconceived for non-compacted pseudo-random
patterns, then the area overhead is reduced by ∼26% in the considered 65nm
library. Interestingly, the breakdown is completely diﬀerent. MISRs are used
for response analysis and account for most of the area overhead. LFSRs are
extremely compact TPGs while less than 3% of the area is devoted to the
test wrapper. In this architecture, block cascading was extensively used (i.e.,
test responses of some blocks are fed as test patterns to downstream blocks,
at least until cascading does not hurt coverage too much) thus cutting down
on the test wrapper overhead.
When it comes to the test architecture with test set compaction, the area
overhead is 9.8% with respect to the deterministic approach. The proposed
solution optimizes pseudo-random patterns thus incurring in a small area
overhead, while signiﬁcantly improving test application time (see section 7.2).
Most of the area overhead is due to the multiple instantiation of comparators
and to the test wrapper, which needs to provide ﬁner-grain circuit isolation in
test mode. The applied optimizations in the test infrastructure proved very
108
Ultra-Low Latency NoC testing via Pseudo-Random Test Pattern
Compaction
eﬀective in reducing the area overhead, to the extent that it closely tracks
the overhead of the deterministic test architecture.
7.2 Testing time
Table 7.1: Test Application Time Per Block
TPGs Multiplexer Output LFSR Input LFSR
Injection period 1 cycle 31 cycles 31 cycles
#Vectors 239 8 (7.7) 8 (7.7)
Total Cycles 239 239 239
Table 7.1 contains the injection rates of the TPGs used to optimize the
cascade of Fig. 7.6. In fact, the TPG located at the beginning of the cascade
starts to inject the data test as soon as the test mode signal driven by the
BIST-engine is high. As mentioned in Table 7.1, this injection occurs for∼239
cycles. At the same time, the multiplexer TPG controls the select inputs of
the crossbar through the ”select-mux” block. Meanwhile, local LFSRs of
the output/input buﬀers in the upstream/downstream switch control the
read/write pointers of data-path registers. These local LFSRs continuously
inject patterns after each 31 cycles as mentioned in Table 7.1. Overall, the
test application time amounts to 239 cycles.
Table 7.2: Testing Cycles as function of the Testing Approach
Testing Technique Testing Time (cycles)
Compressed Pseudo-Random Testing Approach 239
Deterministic Testing Approach 1104
Pseudo-Random Testing Approach 10000
Table 7.2 compares this value with those of the alternative approaches, for the
same target coverage of approximately 99%. All approaches implement some
form of testing cooperation between neighboring switches and share the same
baseline switch architecture. However, test set compaction, proper choice of
the granularity of the circuits to test (and consequent compaction eﬃciency)
and merging of test phases make the approach of this chapter the fastest.
7.7.3 Coverage 109
Only handcrafted deterministic patterns can somehow approach the test-
ing time of this chapter with around 1104 cycles. Although non-compacted
pseudo-random patterns can achieve 96% of coverage in a comparable time
with handcrafted deterministic patterns, they take around 10000 cycles to
reach around 98% of coverage.
Table 7.3: Test application time and coverage of diﬀerent testing methods
Test Cycle Coverage
Our 239 98.3%
[3] 864 - 1104 99.3%
[2] 3.88 x 102 - 2.89 x 103 97.79%
[20] 4.05 x 105 95.20%
[13] 2.74 x 103 99.89%
[14] 9.45 x 103 - 3.33 x 104 98.93%
[21] 5 x 104 - 1.24 x 108 N.A.
[11] 320 99.33%
[12] 200 x 103 full (no exact numbers)
Our test application time compares favorably with previous work, as Table
7.3 shows. Only [2] and [11] are somehow competitive. However, [2] does not
test the control path while [11] reports 320 cycles for a 3x3 mesh (made of
a simpliﬁed switch architecture) which however grow linearly with network
size. Also, this latter approach makes additional use of BIST logic for the
control path not accounted for in the statistics.
We feel that area overhead is hardly comparable with previous work since
whenever numbers are available, features of the testing frameworks are very
diﬀerent (e.g., control path not tested [2], test patterns generated externally
[20, 14], diagnosis missing [20, 13, 14, 21], lack of similar test time scalability
[8, 11], NoC architecture with overly costly links [13]). Moreover, the impact
of synthesis constraints is never discussed.
110
Ultra-Low Latency NoC testing via Pseudo-Random Test Pattern
Compaction
Table 7.4: Coverage as function of the Testing Approach
Testing Technique Coverage (%)
Compacted Pseudo-Random Testing Approach 98.3%
Deterministic Testing Approach 99.30%
Pseudo-Random Testing Approach 98.24%
7.3 Coverage
The obtained coverage for single stuck-at faults is illustrated in Table 7.4. The
technique proposed in this chapter tracks the coverage of the pseudo-random
approach although does far better in terms of latency.
At the same time, our technique removes the burden of deriving handcrafted
test patterns and enables the use of test generation and compaction tools,
with one order of magnitude lower testing latency.
Table 7.5: Compaction Table
Combinational Random Random Compacted Compacted
Logic (#vectors) (coverage) (#vectors) (coverage)
FSM Input 1000 100.00% 10 100.00%
FSM Output 1000 100.00% 17 100.00%
LBDR 400000 93.04% 61 92.17%
ARBITER 170000 99.37% 42 99.37%
Table 7.5 shows the impact of the compaction tool on the pseudo-random
test set generated for each combinational block of the switch control logic.
The second and third columns of the table report the number of pseudo-
random vectors together with their coverage while the last columns show the
number of vectors with the respective coverage once they are compacted by
the tool. It is possible to notice that the compaction operation is eﬃcient
and the compacted test set tracks the previously obtained coverage for each
tested module.
7.8 Conclusions 111
8 Conclusions
This chapter presents a testing methodology and architecture support for
NoCs that aim at the minimization of test application time, a requirement
that well matches the future requirements of lifetime testing frameworks.
In fact, while testing latency was not a concern for production testing, it
becomes such when the testing procedure is run at system bootstrap and/or
at runtime. We demonstrate NoC testing in less than 250 cycles. Above all, we
do not achieve this result with handcrafted deterministic test patterns, but
rather with an optimization methodology of pseudo-random patterns. The
guiding principle is test set compaction, although the low-latency requirement
forces a careful selection of the logic to test for best compaction eﬃciency.
The trade-oﬀ is therefore between test application time and test wrapper
overhead, although the ﬁnal area footprint tracks that for a test architecture
with handcrafted deterministic patterns but with one order of magnitude
lower testing latency.

Chapter 8
Cost-eﬀective Contention
Avoidance in a CMP with
Shared Memory Controllers
1 Abstract
Eﬃcient CMP utilisation requires virtualisation. This forces multiple appli-
cations to contend for the same network resources and memory bandwidth.
In this chapter we study the cause and eﬀect of network congestion with re-
spect to traﬃc local to the applications, and traﬃc caused by memory access.
This reveals that applications close to the memory controller suﬀer because
of congestion caused by memory controller traﬃc from other applications. We
present a simple mechanism to reduce head-of-line blocking in the switches,
which eﬃciently reduces network congestion, increases network performance,
and evens out the performance diﬀerences between the CMP applications.
2 Introduction
The access to the oﬀ-chip memory in large chip multiprocessors (CMPs)
based on a switched interconnect (NoC) consumes a signiﬁcant portion of the
bandwidth in the on-chip network. Furthermore, this traﬃc is targeted to-
wards speciﬁc areas of the chip where the memory controllers are connected.
114
Cost-eﬀective Contention Avoidance in a CMP with Shared Memory
Controllers
Previous studies [57] show that the placement of these memory controller
connections have a signiﬁcant impact on the network load, but the authors
do not study how this impacts the performance of the applications them-
selves and the complex interaction between the local traﬃc (caused by cache
coherency protocols) and the memory controller traﬃc. The initial intuitive
understanding of the eﬀect of memory controller access point placement on
application performance is that the applications located closest to the mem-
ory controller access points will experience better performance compared to
applications allocated further away [70]. However, we show that applications
that reside close to the memory controller might be more severely aﬀected
by the interplay between local traﬃc and memory controller traﬃc.
Figure 8.1: CMP tile-based design with dynamic application domains
8.2 Introduction 115
Figure 8.2: Basic switch architecture
To ease development, the tiles in a CMP are usually homogeneous, with a
structure as displayed in Figure 8.1. Every tile has a private level I and (usu-
ally) a shared level II cache, together with the processing core and a switch
to access the on chip network. Oﬀ-chip memory (DRAM) is accessed through
one or more memory controllers connected to the on-chip network, usually at
the edge of the chip, through one or more ports. The network on chip carries
cache coherency traﬃc between the level I and level II caches, and memory
access traﬃc to and from the memory controller. The two traﬃc types may
or may not be divided into two virtual networks (using virtual channels).
The interaction between these two traﬃc types is the core of the issue we
study in this chapter. Local traﬃc from one application (cache coherency
traﬃc) should not interfere with the local traﬃc from other applications,
as application isolation is a core concept of CMP virtualisation [59]. Most
applications will, however, be aﬀected by the memory controller traﬃc from
other applications. In this chapter we study how and to what extent the local
and memory controller traﬃc contribute to network congestion and how this
aﬀects application performance. Based on this study we present a mechanism
to reduce head-of-line blocking, and thus network congestion, both with and
without virtual channels. The structure of this chapter is as follows: Section 3
116
Cost-eﬀective Contention Avoidance in a CMP with Shared Memory
Controllers
presents the NoC background and the related work. In Section 4 we describe
the congestion problem in on-chip networks and the causes behind it, and we
present our congestion control solution in Section 5. Next, in Section 6, we
detail the evaluation scenario and the results obtained, and ﬁnally in Section
7, we present some conclusions and future work.
3 NoC Background and Related work
There is signiﬁcant ongoing research to study how application mapping and
basic properties of virtualisation is related to network performance. In [59],
the importance of traﬃc isolation and contiguous application mapping is
presented to demonstrate the foundations of virtualisation. Das et al. [70]
study application mapping mechanisms, and show that memory intensive
applications should be located close to the memory controller, but the authors
do not study the impact of application traﬃc and memory controller traﬃc
on the mapping. To simplify such mapping problems, Abts et al. [57] study
alternative memory controller placements by moving the memory controller
access points towards the centre of the chip. This breaks the regularity of the
chip, both in design and routing, so further study is required before these
strategies may realistically be employed. Finally, Sanchez et al. [64] describe
how diﬀerent NoC topologies can impact on application performance, but do
not consider the location of the application relative to other applications and
the memory controllers in the CMP.
There is a number of solutions in [65, 66, 67] that attempt to reduce the
negative eﬀect of shared resources through quality of service (QoS) based on
priority schemes. Although all these solutions can alleviate network conges-
tion by prioritising diﬀerent traﬃc types, their objective is to diﬀerentiate
the traﬃc and they do not focus on the congestion problem itself. As a con-
sequence, there maybe congestion within each traﬃc class for unpredictable
traﬃc patterns.
A number of solutions for CMP NoCs are presented in [60, 61, 62, 63]. The
authors describe mechanisms that collect congestion information from the
neighbouring nodes through the routing process and buﬀer ingress/egress
monitoring. The idea is to oﬀer an alternative path to route around a con-
8.4 NoC Congestion 117
gested area of the chip. However, this assumption will impact negatively
creating more congested resources, as it is impossible to avoid the congested
region if all the congested traﬃc has the same target (the memory controller).
Van den Brand et al. [2] and Thottethodi et al. [19] rely on a central controller
to gather congestion information from the network. Whereas the former uses
a guaranteed service traﬃc class to propagate congestion notiﬁcations to the
sources, the latter uses a separate control network for this purpose. Neither
solution oﬀers great scalability because of the centralisation. There is still an
ongoing ﬁeld of research on many aspects related to congestion management
in NoCs for CMPs, but we have found that the basic congestion problem in
a virtualised CMP is not well understood. Therefore, our objective with this
chapter is to present a study on how congestion problems arise in the event
of many concurrent applications with shared resources (memory controllers).
This work serves as a motivation and guide in the search for cost-eﬀective
resource management solutions, and we present a solution to deal with the
congestion problems in these scenarios.
4 NoC Congestion
In this section we describe the concept of network on chip contention and how
this leads to network congestion. Furthermore, we examine the relationship
between the local traﬃc and memory controller traﬃc in order to determine
how this will aﬀect application performance based on its location on the chip
relative to the memory controller access points.
4.1 NoC Contention
Whenever a NoC packet enters a switch, it is buﬀered, and the header infor-
mation is read to determine the output port for the packet from the switch
(see Figure 8.2). The packet must then wait until it reaches the head of the
queue and the appropriate output port is available (i.e. not receiving a packet
from another port in the switch and not blocked by ﬂow control). NoCs em-
ploy ﬂow control to ensure that packets are never dropped by guaranteeing
that there is buﬀer space available in the next hop switch before forwarding
118
Cost-eﬀective Contention Avoidance in a CMP with Shared Memory
Controllers
the packet across the link [68]. Since multiple packets in diﬀerent input ports
in the switch may have the same output port, a given packet might have to
wait for several scheduling rounds (output port contention) before it is al-
lowed access to the switch crossbar and can continue. During this time there
may be other packets in the same buﬀer with diﬀerent output ports that are
available. However, these packets cannot proceed because they are blocked
by the ﬁrst packet in the queue. This is known as head-of-line blocking.
Together with head-of-line blocking, the ﬂow control mechanism causes con-
gestion trees to build up in the network. Whenever a packet is blocked, it will
block packets upstream, gradually expanding the congestion tree branches
from the tree root through this back-pressure mechanism. The root is the
switch without enough capacity to forward all incoming packets (the place
where the packets are ﬁrst blocked). For the memory controller traﬃc, con-
gestion tree roots will typically be the cores that are the memory controller
access points. Output port contention and head-of-line blocking combined
with the back-pressure caused by the ﬂow control mechanism leads to net-
work congestion at high network load, which has a signiﬁcant impact on the
performance of the NoC.
4.2 Application Performance Relative to Memory Con-
troller Location
When using virtualisation to support multiple concurrent applications on a
CMP, the two traﬃc types (local traﬃc and memory controller traﬃc) may
or may not be separated into two diﬀerent virtual networks using virtual
channels. If all the traﬃc runs on the same virtual channel (i.e. one virtual
channel) it is obvious that applications that suﬀer congested transit mem-
ory controller traﬃc will experience congestion in the local traﬃc as well.
However, using two virtual networks allows a separation of the traﬃc which
reduces the interaction between the two traﬃc types.
With two virtual channels, each channel is typically guaranteed 50% of the
physical channel bandwidth. Consequently, as long as neither of the two traf-
ﬁc types have a demand greater than 50% of the channel bandwidth, there
is no signiﬁcant interaction between the traﬃc. However, most applications
8.5 NoC Congestion Control 119
have a larger amount of local traﬃc than memory controller traﬃc. Thus, the
applications that are located far away from the memory controller and have
little transit memory controller traﬃc, the application is free to use more than
50% of the bandwidth for local traﬃc. For the applications located closer to
the memory controller the amount of transit memory controller traﬃc in-
creases drastically, which reduces the eﬀective local traﬃc down to max 50%
and may introduce congestion problems for the local application traﬃc. Con-
sequently, applications located closer to the memory controller will exhibit
worse performance. This contradicts previous studies which concluded that
applications close to the memory controller had better performance [69, 70],
and we clearly see this eﬀect in the evaluation section.
This discussion has shown that even though a large degree of traﬃc isolation
can be achieved using virtual channels, there is still interaction which can ad-
versely aﬀect application performance as we will see in the evaluation section
(Section 6). We will also see that not separating the traﬃc has even more
adverse eﬀects on application performance. Eﬃcient resource management
in terms of congestion control is therefore required, both to increase overall
eﬃciency of the chip and fairness between the running applications.
5 NoC Congestion Control
For congestion management, we propose HACS (Head-of-line Avoidance Con-
gestion Skip-ahead), a head-of-line blocking observation mechanism that al-
lows buﬀered packets to bypass the packet that is at the head of the queue.
The core mechanism is presented in Figure 8.3. Note that this mechanism
is supported under virtual cut-through packet switching. Whenever a packet
is stalled for a given time period at the head of a buﬀer, HACS will search
further back in the queue for the ﬁrst packet that is routed to a free out-
put port, because of a diﬀerent destination, and let this skip to the head of
the queue. This eﬀectively reduces head-of-line blocking with the result of
reduced congestion.
We now discuss the implementation of HACS in xpipesLite [68]. All packets
are assumed to be 4 ﬂits long by padding shorter ones and by splitting longer
messages into multiple packets. The network guarantees in-order delivery
120
Cost-eﬀective Contention Avoidance in a CMP with Shared Memory
Controllers
of packets headed to the same destination. An arbiter is instantiated for
each output port to perform round robin arbitration among all inputs with
valid asserted and presenting a head ﬂit. The switch implements the LBDR
mechanism [71].
Assume two packets ”A” and ”B” are stored in an input buﬀer (see Fig-
ure 8.3), and let the arbiter of the output port requested by ”A” be stalled
(blocked), thus preventing packet forwarding. In the HACS switch, a timer
is activated upon snooping such a stall condition. If the stall signal changes
during the count-down, the timer will be reset till the generation of the next
stall. If the stall is still high at the time-out, the control logic shifts the
read pointer to the head ﬂit of the second packet. Before computing the
destination of ”B”, the LBDR routing logic saves the destination of ”A” in
backup registers (Lbdr-out-A in the ﬁgure) for further comparison with the
target output of packet ”B”. A XOR comparator compares the two target
ports required by ”A” and ”B”. If they are diﬀerent and the port requested
by ”B” is available, the stall goes down and ”B” is forwarded. If not, the
read pointer shifts once again to packet ”A” till the stall is deasserted. If
the stall signal is removed while performing the target port comparison, the
valid signal is driven low to avoid sampling ”B” before ”A”, thus preserving
the order on each output port. In this unfortunate and very unlikely case,
the switch experiments one (1) cycle overhead before being able to forward
”A”. HACS can also be applied to each virtual layer of a network with vir-
tual channel support. As previously illustrated, 2 VCs may be considered
for memory controller traﬃc separation. In practical terms, we followed the
strategy proposed in [72]. Essentially, the HACS switch is replicated twice,
while placing a demux in front of each input port and a mux with associated
arbiter after each output port. The link is enhanced with a virtual channel
identiﬁer and with a ﬂow control signal for each virtual channel. This is done
to exploit logic synthesis optimisations for the sake of area eﬃciency.
6 Evaluation
In this section we ﬁrst describe the simulation environment we used for the
evaluations, followed by a discussion of the results obtained, including the
8.6.1 System conﬁguration 121
? ?
???? ?????????
???
????????????
????????????
????
????????????????
????? ?????
?????????????????
??????????????
???
?????????????
????????????
???????????????
??????????
????????
?????????????
???????????????
????????????
??????????????
????
?????
?????
???
???
?????
Figure 8.3: Switch architecture
results obtained with HACS and its implementation costs after synthesis.
6.1 System conﬁguration
Our simulation framework is a combination of tools chosen to simulate a
CMP system as closely as possible. Multi2sim [73] is a simulation framework
for heterogeneous computing that allows one or more applications to run
on top of it in CMP-like scenarios. It is able to model a complete memory
hierarchy system integrated into the CMP and its connection to the respec-
tive processor cores. We combined Multi2sim with a cycle-accurate ﬂit-level
network-on-chip simulator called gNoCsim (developed by Universidad Po-
litecnica de Valencia, and being used in the NaNoC project [74] by diﬀerent
partners). GNoCsim is able to simulate the network between all the resources
in the chip; caches, memory controllers, and processor cores. For the evalua-
tion process, we modelled a CMP that resembles current chip conﬁgurations
like in Figure 8.1. This conﬁguration implements a tile-based system, and
122
Cost-eﬀective Contention Avoidance in a CMP with Shared Memory
Controllers
each tile is composed of a processor core, a private L1 cache, a bank of a L2
shared cache, a memory directory bank to be used with the directory-based
MOESI cache coherency protocol, and diﬀerent conﬁgurations of memory
controllers. Each memory controller is connected to the main memory with 2
channels (each memory controller has two access points). A detailed overview
of the chip conﬁguration is shown in Table 8.1.
Parameter Conﬁguration Parameter Conﬁguration
Core x86 Topology 10× 10 2-D mesh
L1 cache 16 KBytes Instructions Routing mechanism LBDR + SR
16 KBytes Data
Total 32 KBytes per core
2 cycles latency
2-way associativity
64 bytes block size
L2 cache 256 Kbytes per core Packet switching Virtual cut-through
20 cycles latency (VCTlite) [75]
4-way associativity
64 bytes block size
Main memory 1 Gbyte total Buﬀer queue size 12 ﬂits
200 cycles latency
Coherence protocol MOESI CMP, directory-based Flit-size 8 bytes
Table 8.1: CMP conﬁguration.
A 10 × 10 2-D regular mesh topology was used for the CMP system. The
LBDR [71] mechanism was used for the routing purposes allowing for routing-
contained application domains in combination with the Segment-Based Rout-
ing algorithm (SR) [76]. Virtual networks are used for diﬀerent levels of traﬃc
of the memory hierarchy system, implemented as multiple virtual channels (a
total of two virtual channels are used) except for Figure 8.8 were no virtual
channels are used.
For the evaluations we used a collection of applications from the SPLASH-2
benchmark with the default parameters deﬁned in [77]. The applications are
statically mapped to the chip when the experiment is set up. Applications are
mapped to completely ﬁll the chip, giving a fair share of cores to each appli-
cation. Every batch consists of a single application type from the benchmark
suite rather than being composed of a collection of mixed applications. This
regularity makes it signiﬁcantly easier to generate relevant statistics and spot
8.6.2 Results 123
trends in the results, such as to get averaged results for the execution time
comparison. Running a mix of applications will introduce spikes in the com-
munication, but this will be evened out by the number of applications over
time, so the conclusions will still be the same. See Figure 8.4 for an example
of the mapping of 32 concurrent applications with 4 memory controllers.
Figure 8.4: 32 concurrent applications mapped on the system
6.2 Results
We have evaluated several combinations of number of concurrent applications
and memory controllers for a 10 × 10 mesh. Speciﬁcally, we have evaluated
6, 8, and 12 applications with one memory controller, 12 and 16 applica-
tions with two memory controllers, and 32 applications with four memory
controllers. Due to space constraints we report the results for 12 applica-
tions with one memory controller and 32 applications with four memory
controllers. The general trend from the results is that network performance
124
Cost-eﬀective Contention Avoidance in a CMP with Shared Memory
Controllers
decreases and unfairness (the diﬀerence in runtime based on application lo-
cation, with two virtual channels) increases as the number of concurrent
applications increases for a given number of memory controllers.
We have plotted the execution time distribution for 12 applications (ocean
workload) with a single memory controller both with (Figure 8.7) and with-
out (Figure 8.8) virtual channels. The memory controller is located in the
uppermost corner. The ﬁgure with virtual channels clearly shows how the
threads that are located closer to the memory controller have a longer ex-
ecution time (as much as 7.5% longer than when running alone) than the
thread located farther away. The picture is more chaotic without the use of
virtual channels. There is no clear unfairness, however, the overall increase
in execution time (8.1%) is larger than with virtual channels.
In Figure 8.5 we display the mean squared error between injected and ac-
cepted traﬃc for diﬀerent applications in the scenario with 32 concurrent ap-
plications with 4 memory controllers, with and without HACS implemented.
In the ﬁgure, HACS2 is allowed to skip ahead the second packet, while HACS3
may skip ahead the second or third packet. The ﬁgure shows how the impact
of performance degradation due to congestion for the diﬀerent applications.
HACS2 and HACS3 are able to reduce the penalty to only 2.5% in average.
Note that there is negligible diﬀerence between HACS2 and HACS3. The per-
formance degradation is signiﬁcantly worse with fewer memory controllers.
Figure 8.6 shows network throughput as a function of time for the ocean
workload. The uppermost plot is the injected traﬃc, and the bottom plot is
the accepted traﬃc without congestion control, a clear case of a congested
network. HACS2 and HACS3 solutions almost remove all the congestion,
handling almost all the injected traﬃc. The second to bottom line is HACS
without virtual channels. This still increases averaged network throughput
by around 40%, although the result is poorer than with virtual channels. The
designer has to assess the trade-oﬀ between performance and implementation
costs.
Summarising, the objective of these evaluation cases was to reproduce scenar-
ios that try to reﬂect current chip conﬁgurations, and realistically illustrate
the eﬀect of multiple simultaneous applications. The cost/performance trade-
oﬀ depends on how much resources are available (in our case, the amount of
8.6.3 Hardware breakdown 125
Figure 8.5: 32 concurrent applications mapped on the system (MSE between
injected and accepted traﬃc)
memory controllers) and there is a need for congestion management strate-
gies that can alleviate the problem with minimal impact on the design of
the chip. In the next section we evaluate the cost of the congestion control
mechanism we have developed.
6.3 Hardware breakdown
This subsection characterises area and critical path delay overhead of the
HACS switch compared to a baseline one taken from the xpipesLite NoC
library [68]. The reference switch implements input buﬀering, stall/go ﬂow
control and virtual cut-through switching. Baseline and HACS switches were
synthesised for a target speed of 500 MHz in a 65nm industrial technology
library. Normalised post-synthesis area results are illustrated in Figure 8.9.
The area of the HACS switch without virtual channels is about 5.34% larger
compared to the baseline switch. The implementation with virtual channels
results in 2.09x the area of the virtual channel-less switch. Observe in the
126
Cost-eﬀective Contention Avoidance in a CMP with Shared Memory
Controllers
 0
 0.2
 0.4
 0.6
 0.8
 1
 0  50  100  150  200  250  300a
ve
ra
ge
 n
et
w
or
k 
th
ro
ug
hp
ut
 (f
lit
s/
cy
cl
e/
ni
c)
time (cyclesx1000)
oceaninjected32.pt
oceanaccepted32.pt
Figure 8.6: 32 concurrent applications mapped on the system (Averaged net-
work throughput, ocean workload)
ﬁgure that the baseline input buﬀer features approximately the same area
of the input buﬀer with the new logic to shift the read and write pointers,
thus denoting the marginal impact on control logic. On the other hand, most
of the area overhead is due to the timer inserted in the switch and is about
4.09%. This could be improved in future solutions by using buﬀer thresholds
instead of a timer. After synthesis, the critical path of the new switch without
VCs was proved to be degraded by less than 1% with respect to the baseline
one. The virtual channel implementation contributes an additional 3% of
critical path degradation, associated with the arbiters in the switch output
ports selecting which virtual channel to move forward.
7 Conclusions
We have studied the eﬀects of shared memory access in a CMP with multi-
ple concurrent applications. It is often assumed that network congestion is
8.7 Conclusions 127
Figure 8.7: Execution time distribution, 1 memory controller, ocean workload
(With virtual channels)
not an issue for CMP systems because of the abundant bandwidth in the
network on chip. Our evaluations show that network congestion may indeed
be a problem when multiple applications access shared memory through the
memory controllers available on a typical CMP, and we developed a simple
solution, HACS, to remove this congestion. We have observed some eﬀects
from our experimental results. First, the hotspots formed by the memory
controller traﬃc lead to network congestion, which can degrade the perfor-
mance of applications by 15% in average in our scenarios. Second, if there
is a high degree of local traﬃc (which is often the case), the applications
allocated close to the memory controller will have less bandwidth for local
traﬃc than applications located further away. The applications closest to the
memory controllers are therefore penalised and have longer execution times
compared to the others. Further work includes evaluating a wide variety of
network controller conﬁgurations and workloads.
128
Cost-eﬀective Contention Avoidance in a CMP with Shared Memory
Controllers
Figure 8.8: Execution time distribution, 1 memory controller, ocean workload
(No virtual channels)




	







	






	

	


	

	
  










	






 !!
Figure 8.9: Switch area at 500 MHz
Chapter 9
Final FPGA Prototyping of
Homogeneous Multicores.
1 Abstract
This chapter reports about the prototyping of design methods mentioned in
the previous chapters on a Xilinx Virtex-7 FPGA. Boot-time testing and con-
ﬁguration, runtime detection of faults, runtime reconﬁguration of the routing
function, dynamic virtualization of the interconnect fabric are especially val-
idated on the FPGA prototype, where a 4x4 multi-core system has been im-
plemented and managed. The advanced form of platform control is achieved
via hardware/software co-design and co-optimization.
2 Introduction
NoC design principles have recently reached a stage where they start to sta-
bilize, in correspondence to their industrial uptake. A key property that novel
NoCs cannot miss is to guarantee a potentially fast path to industry, since
NoC deployment is today a reality. An important requirement for this pur-
pose is the eﬃcient testability of candidate NoC architectures. This property
is very challenging due to the distributed nature of NoCs and to the diﬃ-
cult controllability and observability of its internal components. When we
also consider the pin count limitations of current chips, we derive that NoCs
will be most probably tested in the future via builtin self-testing (BIST)
130 Final FPGA Prototyping of Homogeneous Multicores.
strategies.
Finally, there is an increasing need in embedded systems for implementing
multiple functionalities upon a single shared computing platform. The main
motivation for this are the constraints set for systems size, power consump-
tion and/or weight. This chapter reports on the ﬁrst-time prototyping of
a Network-on-Chip capable of supporting all of the advanced features de-
scribed above, and represents a prove of the validation of the industry-ready
NoC. The presented prototype builds on the ﬁrst switch variant mentioned
in chapter 1. Then, it validates the (re-) conﬁguration capabilities that pre-
serve safe network operation in the presence of wanted (e.g., virtualization)
and unwanted (e.g., manufacturing defects, intermittent faults) eﬀects. The
prototyping platform is represented by the Xilinx Virtex-7 evaluation board
named VC707, described in section3. The prototyped system implemented
inside the FPGA is a homogeneous multicore processor, which resembles
programmable hardware accelerators of hierarchical, high-end embedded sys-
tems, or basic computation clusters of many-core processors. The validated
design methods include:
1. boot-time testing and diagnosis of the 4x4 2D mesh NoC, targeting
permanent faults;
2. switch-level and network-level fault-tolerance, targeting transient faults
and intermittent faults (i.e., those faults that rapidly anticipate the
breakdown of links or switch components;
3. runtime reconﬁguration of the network routing function, with logic-
based distributed routing as the underlying routing mechanism. The
validated reconﬁguration procedures are twofold: at boot-time, without
background traﬃc, and at runtime, with background traﬃc.
4. Dynamic virtualization, i.e., partitioning of the whole NoC into iso-
lated partitions running diﬀerent applications. As such, this chapter
validates:
5. the design methods for supporting NoC static irregularities (routing
methods, testing and diagnosis methods);
9.3 FPGA Platform 131
Figure 9.1: VC707 baseline prototyping board.
6. the design methods for supporting dynamically virtualized NoCs (error
detection and signaling mechanisms, runtime reconﬁguration methods,
virtualization methodology);
7. methodologies for seamless integration of NoC topologies within IP
cores.
3 FPGA Platform
The target system to prototype is overly complex, hence calling for high-end
FPGAs and development boards, not to incur integration capacity limits.
The Virtex-7 FPGA VC707 Evaluation Kit was selected for that. It is a full-
featured, highly-ﬂexible, high-speed serial base platform using the Virtex-7
XC7VX485T-2FFG1761C and includes basic components of hardware, design
tools, IP, and pre-veriﬁed reference designs for system designs that demand
high-performance, serial connectivity and advanced memory interfacing. The
included pre-veriﬁed reference designs and industry-standard FPGA Mezza-
nine Connectors (FMC) allow scaling and customization with daughter cards.
The XC7VX485T FPGA features 485760 logic cells, 75900 CLB slices, 2800
DSP slices, 37080 kb of block RAM, 14 total I/O banks and 700 max. user
I/O. The key features of the evaluation board (see Figure9.1) are as follows:
132 Final FPGA Prototyping of Homogeneous Multicores.
. GA VC707 Evaluation Kit: ROHS compliant VC707 kit including the
XC7VX485T-2FFG1761 FPGA
. Conﬁguration: Onboard JTAG conﬁguration circuitry to enable conﬁgura-
tion over USB, JTAG header provided for use with Xilinx download cables
such as the Platform Cable USB II, 128MB (1024Mb) Linear BPI Flash for
PCIe Conﬁguration, 16MB (128Mb) Quad SPI Flash.
. Memory: 1GB DDR3 SODIMM 800MHz / 1600Mbps, 128MB (1024Mb)
Linear BPI Flash for PCIe Conﬁguration, SD Card Slot, 8Kb IIC EEP-
ROM.
. Communication and Networking: GigE Ethernet RGMII/GMII,SGMII, SFP+
transceiver connector, GTX port (TX,RX) with four SMA connectors, UART
To USB Bridge, PCI Express x8gen2 Edge Connector (lay out for Gen3).
. Display: HDMI Video OUT, 2 x16 LCD display, 8X LEDs.
. Expansion Connectors: FMC1 - HPC (8 XCVR, 160 single ended or 80
diﬀerential, user-deﬁned pins), FMC2 - HPC (8 XCVR, 116 single ended or
58 diﬀerential user-deﬁned pins), Vadj supports 1.8V, IIC.
.Clocking: Fixed Oscillator with diﬀerential 200MHz output used as the sys-
tem clock for the FPGA, programmable oscillator with 156.250 MHz as the
default output, default frequency targeted for Ethernet applications but os-
cillator is programmable for many end uses, diﬀerential SMA clock input,
diﬀerential SMA GTX reference clock input, Jitter attenuated clock used
to support CPRI/OBSAI applications that perform clock recovery from a
user-supplied SFP/SFP+ module.
.Control and I/O: 5X Push Buttons, 8X DIP Switches, Rotary Encoder
Switch (3 I/O), AMS FAN Header (2 I/O).
.Power: 12V wall adapter or ATX, Voltage and Current measurement capa-
bility.
9.4 The System Under Test 133
? ?
???????? ????????????????? ????????????? ????????????
???????????
??????????
??????
?????
??????????
??????????
??????????
????
??????
??????
?? ??
????
????
?????????
????????
?????????
??????
????????
????????
?
??
?
?
?
?
?
?
?
?
??????????? ???????
????????
????????
?? ?? ?? ??
?? ?? ?? ??
?? ?? ?? ??
?? ?? ?? ??
??????
?????
????????
?? ?? ?? ??
?? ?? ?? ??
?? ?? ?? ??
?? ?? ?? ??
Figure 9.2: FPGA platform overview
. Debug and Analog Input: 8 GPIO Header, 9 pin removable LCD, Analog
Mixed Signal (AMS) Port.
4 The System Under Test
The high-level view of the design can be found in Figure 9.2. The system
comprises a large number of components within the FPGA. As can be seen
on the left side of the diagram, a relatively standard Xilinx subsystem is
instantiated ﬁrst; this comprises an AXI interconnect linking together a Mi-
croBlaze (to run the supervision software), a small memory and an external
DRAM controller, and several peripheral controllers required to run software
on the MicroBlaze and to communicate with a laptop. The right side of the
diagram depicts the designed components. This part of the system is the
”Device Under Test” (DUT) of the platform, whose functionality is to be
veriﬁed. It comprises mainly:
134 Final FPGA Prototyping of Homogeneous Multicores.
. The main NoC, built as a 4x4 mesh of the ﬁrst variant switches.
. The dual NoC, built as a chain that follows the topology of the main NoC.
The dual NoC is in charge of conﬁguring the main NoC and of collecting
status information (e.g. fault detections) from the main NoC.
. At each node of the main NoC (see also Figure 9.3), a MicroBlaze and
a memory (by Xilinx) are connected to the switch by means of Network
Interfaces.
. Two special blocks, based on AXI NIs, have been designed to connect the
dual NoC to the supervision subsystem. These blocks allow the supervision
MicroBlaze to receive notiﬁcations by the dual NoC, and to reprogram it.
. A sniﬀer module, monitors traﬃc along all links of the main NoC mesh,
computing link utilization. It is designed so that the supervision subsystem
can probe it at regular intervals and transfer its contents towards a useraˆs
laptop.
. A fault injection module has been instantiated along a mesh link. This sim-
ple module, connected to a physical button on the FPGA board, provides
a method to inject faults on that link to test the platformaˆs fault tolerance
and the NoC reconﬁguration capability. To build this platform, we proceed in
steps (Figure 9.4). First, we instantiate within Xilinx Platform Studio (XPS)
a complete design comprising all the supervision subsystem, the 16 additional
MicroBlazes, and the corresponding 16 memories. At this stage, no NoC is
instantiated yet. Using XPS for this task allows us to eﬃciently connect and
conﬁgure all the Xilinx blocks, and facilitates the instantiation of the toplevel
HDL ﬁles. Additionally, this makes it possible to subsequently load the ap-
plications into all 17 MicroBlazes’ memories, and to debug those processor
step-by-step, directly through the Xilinx toolchain, which is Eclipse-based.
After the ﬁrst pass of synthesis, however, we remove from the design the Xil-
inx AXI subsystem which is connecting the 16 additional MicroBlazes and
memories, and swap in the NoC (main and dual) in its place. We then proceed
to ﬁnish the implementation ﬂow within Xilinx ISE by performing mapping,
9.4 The System Under Test 135
placement and routing, and generating the ﬁnal bitstream. We leverage some
key features of the Virtex 7 board, apart from the FPGA chip. The on-board
DRAM is used to provide suﬃcient space for the software running on the
supervision MicroBlaze to work. Physical buttons and switches of the board
are connected to an on-chip GPIO controller to allow the user to interact
with the platform. Finally, a laptop can be connected to the board by means
of two cables to monitor the platform’s operation; one cable carries serial
port signals (piggybacked onto a USB port) and the other carries JTAG sig-
nals (also piggybacked onto a USB port). The former is used to read the
board’s outputs, while the latter allows for programming the board and in-
teractively debugging the on-FPGA MicroBlazes. Custom-written software
runs in three locations of the system: on the supervision MicroBlaze, on the
16 MicroBlazes connected to the main NoC, and on the external laptop.
. The software on the supervision MicroBlaze is tasked with oversight of the
main NoC and data NoC, with regular polling of the Traﬃc Sniﬀers, and
with interfacing with the external world through the serial interface.
. The 16 MicroBlazes connected to the mesh run micro-benchmarks. These
micro-benchmarks have the main role of generating traﬃc on the mesh, so
that the various platform features can be tested. Real functional behaviour
was implemented: the nodes perform pipelined matrix multiplications, ex-
changing data in producer-consumer fashion. More advanced applications
could not be implemented due to the lack of I/O interfaces on these nodes
and due to lack of memory to instantiate a full C library.
. The user’s laptop is connected to the board through a JTAG-over-USB
cable and a serial-over-USB cable. The former can be leveraged mainly by the
Xilinx toolchain, allowing for board programming and debugging. The latter
is monitored by a GUI that displays in real-time the platform status and
link utilization. This GUI allows the user to analyze the impact of running
diﬀerent software on the system and the behaviour upon fault injection or
virtualization implementation.
136 Final FPGA Prototyping of Homogeneous Multicores.
5 Basic components: the on-chip network
A 4x4 mesh with one core and one memory per switch has been chosen
as target on-chip network of the FPGA platform. In particular, Figure9.3
represents the basic components instantiated to realize the 4x4 mesh. A Mi-
croBlaze and a memory are connected to each switch through two Network
Interfaces. Finally, a sniﬀer is placed on each bidirectional network link to
monitor the network traﬃc. The sniﬀers collect information about the traﬃc
crossing the switch-to-switch and NI-to-switch links and deliver such infor-
mation to the global manager (i.e., the supervision MicroBlaze). Both the
NIs and the switches have been designed ad-hoc to support the target on-
chip network where fault-tolerance, testing capability and reconﬁgurability
features are guaranteed. Note that the MicroBlaze also includes a directly-
connected BRAM of 128kB (not shown in the ﬁgure) to store its application
software.
5.1 The Network Interfaces
We instantiate two types of Network Interfaces NIs: an AXI initiator NI
to interface with the MicroBlaze, and an AHB target NI to interface with
the memory. This choice was deliberate (e.g., both could have been AXI)
to demonstrate interoperability among the two. Due to the relatively simple
needs of the MicroBlaze core, which does not support multiple transaction
IDs, we save area by instantiating a small AXI initiator NI with support for
only one such ID. However, the NI is still supporting all AXI features. Both
AXI and AHB NIs, and their interoperability, were extensively tested in RTL
and on the FPGA. For integration into the platform, a few tweaks to the NI
were needed:
. NIs embed routing tables to statically perform source routing. In this plat-
form, routing is distributed and reconﬁgurable to work around faults or to
enforce virtualization. Therefore, the routing tables are modiﬁed to instead
encode the XY coordinates of the destination core; these will be processed
at the switches. The coordinates are expressed as strings of 9 bits: 4 bits
9.6 Basic components: the supervision subsystem 137
for each coordinate (slightly overprovisioned for a 4x4 mesh) plus one bit
to diﬀerentiate among the two local cores at each node, i.e. MicroBlaze and
memory.
. The input and output buﬀers of the NIs are extended to support the NACK-
GO ﬂow control protocol used by their switches, instead of STALL-GO.
. The AXI initiator NIs are extended with two extra pins, directly connected
to FPGA pads, in turn connected to physical switches of the FPGA board.
This means that the user, manually ﬂipping those switches, can change the
value of two bits inside each NI. The NI in turn exposes these two bits to the
MicroBlaze at the reserved address 0x11000000. The MicroBlaze can poll this
location to change among operating modes, e.g. staying idle, or executing one
of multiple pre-programmed applications. As can be inferred from Figure3,
note that in the platform, the 16 MicroBlazes attached to the mesh have no
way to communicate with the external world except for this facility.
6 Basic components: the supervision subsys-
tem
In order to demonstrate the NoC functionality, a supervision subsystem is
required. We choose to instantiate it within Xilinx Platform Studio, and us-
ing as many Xilinx IP cores as possible, for convenience; we integrate it with
custom designed IP when suitable. The subsystem (see Figure3) includes a
Xilinx AXI interconnect, with two masters (a Microblaze and the Dual NoC
Receiver) and numerous AXI, AXI Lite and AHB slaves. At the heart of this
subsystem is a Microblaze running software. This software is tasked with:
. Probing the status of the NoC, e.g. after BIST and upon fault occurrences.
. In response to the above, conﬁguring or reconﬁguring the NoC.
. Awaiting for possible user requests to reconﬁgure the NoC in a virtualized
manner.
138 Final FPGA Prototyping of Homogeneous Multicores.
. In response to the above, reconﬁguring the NoC.
. Polling the link sniﬀers periodically to monitor activity on the NoC links.
. Transferring key information about the platform’s functioning outside the
FPGA through the serial port (or, potentially, an Ethernet port).
To perform these actions, multiple support controllers and devices are needed.
First of all, since the supervision software and the required underlying C
library have a non-negligible footprint, incompatible with on-chip resources,
a DRAM controller is advisable to be able to store the software. To support
the basic functionality of the Xilinx C library, a timer and an interrupt
controller must also be present. (Note that, in contrast, the 16 Microblazes
connected to the NoC mesh do not have access to external memory, timer or
interrupt controller; this limits the capabilities of the software that can be
run on those).
In order to monitor the NoC, it is necessary for the Microblaze to be able
to access the dual NoC. This is done via three components plugged to the
AXI bus: a Dual NoC Driver, a Dual NoC Receiver, and a memory. The
Microblaze can directly write to the Dual NoC Driver, which is a slave on
the AXI bus, to program the main NoC. Due to the way the dual NoC
was designed, the reverse operation cannot be done with a read to the same
device; instead, whenever there is a message requiring attention (e.g., upon
BIST completion or fault detection), the dual NoC sends a packet to the
Dual NoC Receiver, which converts it into an AXI transaction directed at
the on-bus memory (a standard Xilinx core). The Microblaze can periodically
poll this memory to check all notiﬁcations. To supervise the NoC activity, the
Microblaze can also poll the Traﬃc Sniﬀers. These blocks can be connected
to up to 16 links of the main NoC on one side, and to the AXI bus on
the other. For maximum thoroughness, we choose to monitor as many as
80 links of the NoC (almost all, disregarding just a few whose information is
redundant), with ﬁve Traﬃc Sniﬀers in parallel. The sniﬀers include a counter
that is incremented at the passage of any ﬂit; whenever the counter is read
by the MicroBlaze, it automatically resets itself. A simple division yields a
utilization metric. Finally, the FPGA needs external interfaces. First of all,
a GPIO controller allows the Microblaze to periodically check the status of a
9.7 Basic components: the reconﬁguration algorithm 139
few physical buttons and switches on the FPGA board. This allows the user
to change operating modes of the platform; for example, we use this feature to
instruct the software on the Microblaze to initiate the reconﬁguration to get
into virtualized application mode. Two extra blocks are used to communicate
with the user’s computer. A UART controller is an output-only interface
that allows the platform to transfer information to the laptop, where the
GUI by UPV can visualize it. A debug module, relying on a JTAG-over-USB
electrical connection, allows for bidirectional communication: the user can
program the supervision Microblaze, step through its software, and check
the content of certain on-FPGA registers and memories. Since the debug
module allows for monitoring of up to 8 Microblazes, we connect it to the
supervision Microblazes and to selected 7 other Microblazes out of the 16
attached to the main NoC mesh.
7 Basic components: the reconﬁguration al-
gorithm
The supervisor MicroBlaze is constantly monitoring the status of the NoC
through the dual NoC. Whenever a notiﬁcation is received about a fault on
a link, if deemed necessary (e.g. unless it is assumed to be a transient), the
supervisor triggers the reconﬁguration algorithm. This algorithm computes
the required changes in the LBDR bits at speciﬁc switches in order to mi-
grate from the current routing algorithm to a new one that avoids the use
of the notiﬁed link. Those bits are encoded and transmitted through the
dual NoC together with a triggering notiﬁcation to switches to launch the
reconﬁguration process.
8 The application
The MicroBlazes have been programmed in order to start their application
after the 4x4 mesh is conﬁgured, upon ﬂipping a physical switch on the board.
The application run by the MicroBlazes is a matrix multiplication consisting
of the product of a pair of matrices. The MicroBlazes sequentially forward
140 Final FPGA Prototyping of Homogeneous Multicores.
the results to each other in a pipelined producer-consumer fashion. Each
MicroBlaze performs the multiplication of a private matrix and a matrix de-
livered by the previous MicroBlaze of the sequence. Once the matrix product
is computed the resulting matrix is forwarded to the next MicroBlaze. The
lack of I/O interfaces and memory does not allow the implementation of
more advanced applications. The private matrix (mat private) is stored by
each Microblaze into its local (pertaining to its local switch) AHB scratch
memory of 4kB.
The AHB memory is used as storage and for inter-processor communication
in the application. Indeed the incoming matrix from the previous Microblaze
(mat input) is stored in the AHB memory connected to the local switch.
Each MicroBlaze stores its matrix product result (mat output) into the AHB
memory connected to the switch to which the next MicroBlaze of the sequence
is connected. The ﬁrst MicroBlaze of the pipeline initializes its own local
AHB memory before performing the matrix product. Each MicroBlaze has
local registers storing the address of the local AHB memory, the addreses
of the remote AHB memory of the next MicroBlaze, the position within the
pipeline, and the pipeline length.
In order to guarantee the synchronism between the MicroBlazes, custom
semaphores are implemented. Interestingly, these are purely software and
do not need dedicated hardware support. Such a solution slightly increases
the complexity of the code but clearly simpliﬁes the hardware design eﬀort
and the area overhead. Of course this approach is possible only since the
application is ﬁxed and known upfront; more sophisticated synchronization
capabilities would demand hardware-level atomicity support. The goal of
these semaphores is to avoid reading the same incoming matrix multiple
times, and to avoid overwriting output matrices before the next MicroBlaze
has been able to process them. Each MicroBlaze has been enhanced with 4
semaphores. Notice that semaphores to be polled have been placed in the
local AHB memory in order to reduce congestion in the network.
9.9 The physical platform implementation 141
9 The physical platform implementation
Some steps of the implementation ﬂow described in Figure 9.4 can be par-
allelized; for example, the initial platform description involves several blocks
which can be independently synthesized in parallel. Even after joining all the
pieces together, the mapping stage can be run on two threads in the Xilinx
toolchain, and the placement and routing in four.
The platform ﬁlls the FPGA almost completely, as can be seen in Table 9.1.
The left column reports the utilization when the template system generated
in XPS is implemented, the right one is for the same system where the NoC,
dual NoC and associated designed IP (e.g. sniﬀers, dual NoC interface blocks,
etc.) have been instantiated to replace the simple AXI interconnect. It can be
seen that the NoC represents approximately 17% of the FPGAaˆs sequential
resources and 72% of the combinational resources (or a little bit more, since
this is the overhead on top of the default bus); it does not occupy any BRAM
nor require external pins. Due to development timing constraints, no speciﬁc
optimizations could be taken to reduce the area of the design; given the
large number of blocks and the redundancy features (e.g. triplication of some
components, BIST, datapath encoding) built into the NoC, we perceive the
resource utilization ﬁgures as very positive. Note that triplicated logic in the
switch has to be marked with special annotations embedded in the RTL,
otherwise the Xilinx synthesis tools would detect it as redundant and prune
it away. The design is not aimed at, and not optimized for, high performance.
The very high resource utilization features also impose a signiﬁcant timing
overhead as routing necessarily becomes more convoluted and less eﬃcient.
We record a maximum operating frequency of 38 MHz; the critical path is
in the BIST logic of the switch.
10 Validating Built-in Self-Testing and NoC
conﬁguration
In order to validate the Built-in Self-Testing implemented in the 4x4 mesh,
a permanent failure was forced in the network by hard-wiring to zero a link
wire. In this implementation, the failure was injected in the link between
142 Final FPGA Prototyping of Homogeneous Multicores.
switch 11 and 10. However, it could have been freely injected in diﬀerent
locations since the 4x4 mesh has been based on a switch that guarantees
around 97% of stuck-at-fault coverage.
As soon as the FPGA board is booted the BIST automatically starts and
the switches cooperatively exchange test patterns as shown in Figure 9.5.
When the BIST procedure is completed, the BIST managers integrated in
each switch send to the dual NoC the diagnosis information related to the
switch they belong to. In the FPGA platform under test, the BIST manager
of Switch 10 reveals an error on its East input channel where the error has
been injected. Thus it notiﬁes the dual NoC, which takes care of delivering all
the BIST di- agnosis information to the global manager (i.e., the supervision
MicroBlaze). In particular, the diagnosis information crosses the dual NoC
and the Dual NoC Receiver before being stored in the memory connected
to the supervision subsystem (Figure 9.6). The supervision MicroBlaze has
been programmed to periodically check for dual NoC notiﬁcations by polling
the control bus memory (Figure 9.7). It recognizes when the BIST notiﬁca-
tion information has been stored in the memory (i.e., the BIST procedure is
completed) and it runs the conﬁguration algorithm. Thus, it computes con-
ﬁguration bits able to guarantee deadlock-free routes despite the failed link.
The conﬁguration bits are sent to the Dual NoC Driver through the AXI
bus. They cross the dual NoC and conﬁgure the routing mechanism of each
switch (Figure 9.8).
11 Validating Fault Detection and NoC Re-
conﬁguration
Once the network has been tested and permanent faults have been detected
and tackled by the oﬀ-line conﬁguration, the system can be still aﬀected by
run-time transient and intermittent faults. Such faults cannot be handled by
oﬀ-line strategies as they appear and disappear unpredictably. As a result,
the network has been designed as fault tolerant to satisfy the high reliabil-
ity constraints imposed by modern systems. In particular, the fault-tolerant
ﬂow control protocol (NACK/GO) is used on the data path to notify error
9.11 Validating Fault Detection and NoC Reconﬁguration 143
detection and trigger data retransmissions. Although this protocol has been
primarily designed to tackle SEUs (Single Event Upset), the system is also
able to tackle physical eﬀects such as wear-out. Indeed wear-out eﬀects end
up in permanent faults but they are known to have a gradual onset. In prac-
tice, frequent transient faults aﬀecting the same circuitry denote the possible
onset of a permanent fault. Before this happens, the network routing function
could be modiﬁed to exclude the aﬀected circuit from communication traﬃc.
NACK/GO lends itself to such a policy, since its retransmission and/or vot-
ing events may be notiﬁed to the supervision MicroBlaze which may monitor
the distribution and frequency of transient faults over time and eventually
take the proper course of recovery action. This exact policy is supported and
validated by the FPGA platform. Physical buttons and switches of the board
are connected to an on-chip GPIO controller to allow the user to interact with
the platform. The physical buttons have been leveraged to inject transient
faults in the network and validate the above mentioned fault tolerance policy.
For this purpose, a fault injection module has been instantiated along the
link routed from switch 4 to 5. This module is connected to a physical button
on the FPGA board and provides a method to inject transient faults on that
link (see Figure 9.9). Since the link may be idle when the button is pressed,
the fault injection module integrates a simple FSM that waits until a valid
ﬂit is crossing the link to inject the fault, ensuring that actual important
information is corrupted. Therefore, every time the button is pushed, the
error is revealed and notiﬁed to the supervision MicroBlaze (Figure 9.10).
Similarly to the procedure followed by the BIST notiﬁcation, the transient
notiﬁcation crosses the dual NoC and it is stored into the control bus mem-
ory. The supervision MicroBlaze periodically polls the memory also during
run-time operations. Thus it reads the transient notiﬁcation and updates its
register with distribution and frequency of transient faults over time. Only
when the number of transient notiﬁcations from the same link reaches a
threshold is recovery action taken. For the sake of the demonstration, the
MicroBlaze’s software is set to run its reconﬁguration procedure after three
notiﬁcations (i.e., after the button has been pushed three times). Note that
the 4x4 mesh at this stage is irregular since a link has been already disabled
due to a previously detected permanent failure. Thus, the algorithm com-
144 Final FPGA Prototyping of Homogeneous Multicores.
putes the reconﬁguration bits for this irregular network and delivers them
to the dual NoC (Figure 9.11). The new reconﬁguration bits coming from
the dual NoC cannot directly update the existing routing strategy, as during
oﬀ-line operations, since applications are now running. Thus, the network
implements the OSR-Lite reconﬁguration mechanism which avoids stopping
or draining network traﬃc during the transition from one network conﬁgu-
ration to another. The switches of the FPGA platform start to inject tokens
into the network. The tokens follow the channel dependency graph of the
old routing function and progressively drain the network from old packets,
as represented in Figure 9.12.
Resource Utilization Supervisor Subsystem only Full Platform (fJ/bit)
Slice Registers 5% 22%
Slice LUTs 16% 88%
IOs 20% 20%
36-bit BRAMs 61% 61%
Table 9.1: Resource utilization of the Virtex 7 chip.
12 Conclusion
This chapter reports on the prototyping of a 16-core homogeneous multi-
core processor with a fault-tolerant, runtime reconﬁgurable and dynamically
virtualizable on-chip network. The prototyped system has been successfully
validated in its capability of boot-time testing and conﬁguration, transient or
intermittent fault detection, runtime reconﬁguration of the routing function,
and dynamic partitioning and isolation. The validated NoC prototype is a
key enabler for the future evolution of embedded systems. First, it enables
the integration of multiple software functions on a single multi- and many-
core processor (multifunction integration). This is the most eﬃcient way of
utilizing the available computing power. Finally, a comprehensive reliability
framework has been set into place, from switch- level to network-level, while
covering all design aspects (e.g., reliable control signaling) and eﬀectively co-
optimizing diﬀerent architectural features together (fault-tolerance, testing,
9.12 Conclusion 145
Figure 9.3: Basic components of the on-chip network.
?????????
??????????????
????????
??????????
???????
???????? ??????????????
???????????????????????????????
????????
??????????
????????????
???????
????????? ???
????????????????????
???????????????????????????
Figure 9.4: Design ﬂow for platform implementation.
146 Final FPGA Prototyping of Homogeneous Multicores.
Figure 9.5: Built-In-Self-Testing at work (a).
Figure 9.6: Built-In-Self-Testing at work (b).
9.12 Conclusion 147
Figure 9.7: Built-In-Self-Testing at work (c).
Figure 9.8: Built-In-Self-Testing at work (d).
148 Final FPGA Prototyping of Homogeneous Multicores.
Figure 9.9: Transient fault detection and reconﬁguration (a).
Figure 9.10: Transient fault detection and reconﬁguration (b).
9.12 Conclusion 149
Figure 9.11: Transient fault detection and reconﬁguration (c).
Figure 9.12: Transient fault detection and reconﬁguration (d).
150 Final FPGA Prototyping of Homogeneous Multicores.
runtime reconﬁguration, control signaling) 1.
1This chapter has included contents that are referred to a cooperative and interdisci-
plinary work where furher details are in[74].
Chapter 10
Power Characterization of
Optical NoC Interfaces
1 Introduction
The objective of this chapter is to characterize the power consumptions of
an optical network interface with respect to the electronic one. Every elec-
tronic component has been synthesized, placed and routed using a Low-Power
40nm industrial technology library, in order to provide realistic power mea-
surements (not derived from optimistic or ideal estimations). Power metrics
have been calculated by backannotating the switching activity of block inter-
nal nets, and then importing waveforms in the PrimeTime Tool. It is worth
observing that we have applied clock gating for the sake of realistic measure-
ment of static power. Energy per Bit has been computed by removing the
Static Power by the Total power on a component-basis, under 50% switching
activity assumption.
2 Optical Network Interface Architecture
This section describes, to the best of our knowledge, the ﬁrst complete net-
work interface architecture for optical networks as depicted in ﬁgure 10.1.
As a consequence, the objective is not to present the best possible design
point, but rather to start considering the basic components, and indicating
which one deserves the most intensive optimization eﬀort for prime time of
152 Power Characterization of Optical NoC Interfaces
optical interconnect technology. To avoid message-dependent deadlock, every
network interface needs separate buﬀering resources for each one of the three
message classes of the MOESI protocol. This should be combined with the
requirements of wavelength routing: each initiator needs an output for each
possible target, and each target needs an input for each possible source. As
a result, in the baseline version of the NI, each initiator comes with 3 FIFOs
for each potential target, and each target with 3 FIFOs for each potential ini-
tiator. In a more energy-eﬃcient version of the NI (the one in Figure 10.1),
all destinations share the same 3 FIFOs and the ﬂits are sent to diﬀerent
paths afterwards (all logic components after 1x15 demuxes are replicated for
each destination). All the FIFOs at both the transmission and the reception
side must be dual-clock FIFOs to move data between the processor frequency
domain (1.2GHz) and the one used inside the NI. The serializers are respon-
sible for translating the ﬂit into a 10 GHz bit stream. The reception side
is specular: ﬂits must follow the deserialization process and another set of
dual-clock FIFOs. Clearly, ONoCs move most of their complexity to the NIs,
which should therefore not be overlooked by means of overly abstract mod-
els. Another key issue to be considered in NIs concerns the resynchronization
of received optical pulses with the clock signal of the electronic receiver. In
this chapter we assume source-synchronous communication, which implies
that each point-to-point communication requires a strobe signal to be trans-
mitted along with the data on a separate wavelength, and used to correctly
sample received data. Optical transmission of clock signals is an active re-
search ﬁeld: see for instance [114]. This strobe signal is generated starting
from the electrical clock of the transmitter, and removes the need for phase-
locked loops (PLLs) or delay-locked loops (DLLs). In this work, we assume
that a form of clock gating is implemented, therefore when no data is trans-
mitted, the clock signal is gated. Another typically overlooked issue consists
of the backpressure mechanism. We opt for credit-based ﬂow control because
it does not rely on timing assumptions, and credit tokens can reuse the ex-
isting communication paths, thus avoiding any waveguide, and resulting in a
milder impact over static power.
10.3 Power Characterization 153
Figure 10.1: Optical Network Interface Architecture.
3 Power Characterization
Electronic NI buﬀering and frequency converters (dual-clock FIFOs) con-
tribute around 11.5mW (see Fig.10.2 and Fig.10.3). The static power dissi-
pated (Idle power) by the entire network (16 switches), is around 286 mW
(see 10.4 and 10.5, only the top-level clock tree is omitted). Similarly, the
power dissipation of Optical Network Interfaces is computed by composing
the power consumption of each of its sub-blocks (DC FIFOs at the transmis-
sion sides, Demultiplexers, SERs, Synchronizers, DESERs, DC FIFOs at the
reception sides, Multiplexers, and Credit counters).
The static power contribution of all optical components is given by: Laser
sources, Thermal tuning, Transmitter (i.e.the driver-ring modulator couple),
Receiver (i.e., Photodetector, Trans-Impedence Ampliﬁer, and Comparator)
and the source-synchronous clock. The latter addendum is internally com-
posed by further laser sources, Transmitters, Receivers, and MRRs as well.
For these parameters we assume values derived from the literature[113], [112].
These resources must be replicated as many times as the target bit paral-
lelism, and also for the optical clock support. Power metrics of all basic blocks
of our architectures are summarized in Table10.1. The derived static power
values for electronic and optical components are illustrated in ﬁgures 10.4
154 Power Characterization of Optical NoC Interfaces
and 10.5.
Electronic Devices Static Power (mW) Dynamic Energy (fJ/bit)
DC FIFO TX 5 //3 0.12 10.65
DC FIFO RX 5 //3 0.12 8.54
DC FIFO TX 22 //3 0.12 39
DC FIFO RX 15 //3 0.12 26.50
MUX4x1 ARB //3 0.08 0.36
MUX45x1 ARB //3 0.9 5.09
SERIALIZER //3 0.0475 9.41
DESERIALIZER //3 0.0289 7.74
MESO SYNCH //3 0.082 8
BRUTE FORCE //3 0.004234 1.4
DC FIFO TX 5 //4 0.12 12.72
DC FIFO RX 5 //4 0.12 10.2
DC FIFO TX 22 //4 0.12 46.41
DC FIFO RX 15 //4 0.12 31.65
MUX4x1 ARB //4 0.11 0.49
MUX45x1 ARB //4 0.9 5.09
SERIALIZER //4 0.0417 2.63
DESERIALIZER //4 0.0281 6.12
MESO SYNCH //4 0.113 11.1
BRUTE FORCE //4 0.00503 1.66
DEMUX1x3 //4 0.000725 0.92
DEMUX1x15 //4 0.0021 25.21
DEMUX1x4 //4 0.00056 6.72
COUNTER@4bits //4 0.02964 1.014
TSV / 2.5
TRANSMITTER (aggr) 0.025 20
TRANSMITTER (real) 0.100 50
RECEIVER (aggr) 0.050 10
RECEIVER (real) 0.150 25
THERMAL. T @20K 0.020 /
E-SWITCH (3VC) 5.844 193
Table 10.1: Static and Dynamic Power Of Electronic Devices.
4 Analysis and Discussion
Network interfaces are typically oversimpliﬁed, and end up being abstracted
by simple input/output FIFOs of inﬁnite length. Similarly, the blocking eﬀect
of the backpressure mechanism is overlooked. As a consequence, the ONoC
easily proves much more performance-eﬃcient than the electronic counter-
part. Moreover, the lack of a layout analysis in addition to a physical-layer
analysis in ONoC design is another important source of optimism in previous
10.5 Conclusion 155
evaluations. In contrast, the key strength of this research (AMF methodol-
ogy) consists of a careful exploration of E/O and O/E interfaces, accounting
for the contributions and eﬀects of every building block: routing, buﬀering,
serialization and deserialization processes, as well as optical transmitters and
receivers, clock domain synchronizer, backpressure cost.
5 Conclusion
This chapter aims at a high level of practical relevance in the power charac-
terization of an optical NoC vs. its electronic counterpart. The key novelty
consists of the use of an electronic baseline aggressively optimized for low-
power. With conservative projections for optical component parameters, the
major role played by static power is apparent. This calls for new power gat-
ing techniques. With more aggressive projections, the network interface turns
out to be the clear bottleneck to achieve the break-even point with low-power
ENoCs, hence it should be thoroughly analyzed for optimization. In future
work, we will investigate more communication-dominated scenarios, in an at-
tempt to capitalize on the far lower dynamic power consumption of ONoCs.
1
1This chapter has included contents that are referred to a cooperative and interdisci-
plinary work where furher details are in[115].
156 Power Characterization of Optical NoC Interfaces
Figure 10.2: Static Power of Electronic Network Interface vs. Optical Network
Interface@//3.
Figure 10.3: Static Power of Electronic Network Interface vs. Optical Network
Interface@//4.
10.5 Conclusion 157
Figure 10.4: Total Static Power of Electronic Network vs. Optical Network
@//3.
Figure 10.5: Total Static Power of Electronic Network vs. Optical Network
@//4.

Conclusions
This study is focused on the next generation of homogeneous many-core
systems. In particularly it deals with cross-layer design, optimization and
prototyping of interconnection on-chips. The main goals of this thesis were
to design, optimize, test and prototype an on-chip interconnection network.
To achieve these objectives, we ﬁrst analyzed two architectural variants of
a mesh and we made an architectural study of three types of interconnec-
tion on chip to better understand the various trade-oﬀs (Power consump-
tion, area and performance) between synchronous, mesochronous and multi-
synchronous NoCs. Subsequently, we designed various testing infrastructures
and we assessed the coverage, the area overhead and the testing cycles. More-
over, we designed an ultra-low latency network-on chip testing infrastructure,
suitable for online testing. In addition, we design a new congestion avoidance
mechanism named HACS to reduce congestion in the network with inter-
esting results. The most important of all was the validation of some of the
ideas of this thesis on a FPGA prototyping. Finally, our goals were achieved
and we started to pave the way for emerging technologies such as optical in-
terconnect technology by providing a key enabler for the characterization of
power consumption of optical network interfaces. Overall, the thesis is a com-
prehensive contribution to the advance in the ﬁeld of manycore NoC-based
system design.

Bibliography
[1] Simone Terenzi, Alessandro Strano, Davide Bertozzi.
”Optimizing Built-In Pseudo-Random Self-Testing for Network-on-Chip
Switches” - INA-OCMC 2012
[2] S.Y.Lin, C.C.Hsu, A.Y.Wu.
”A Scalable Built-In Self-Test/Self-Diagnosis Architecture for 2D-mesh
Based Chip Multiprocessor Systems”
IEEE Int. Symp. on Circuits and Systems - 2009
[3] A. Strano, C. Go´mez, D. Ludovici, M. Favalli, M.E. Go´mez, D. Bertozzi.
”Exploiting Network-on-Chip Structural Redundancy for A Cooperative
and Scalable Built-In Self-Test Architecture” - DATE -2011
[4] Markus, A.; Raik, J.; Ubar, R.
”Fast and Eﬃcient Static Compaction of Test Sequences Using Bipartite
Graph Representation”
Proc. of the Second Electronic Circuits and Systems Conference (ECS’99)
[5] Sheng Zhang, Sharad C seth, Bhargab B, Bhattacharya.
”Eﬃcient Test Compaction for Pseudo-Random Testing”
Proc. of the 14th Asian Test Symposium (ATS ’05)
[6] S.Stergiou et al.
”Xpipes Lite: a Synthesis Oriented Design Library for Networks on
Chips” - DAC - 2005
[7] D.Wentzlaﬀ et al.
”On-Chip Interconnection Architecture of the Tile Processor”
-IEEE Micro 2005
162 Bibliography
[8] J.Raik, V.Govind, R.Ubar.
”An External Test Approach for Network-on-a-Chip Switches”
Proc. of the IEEE Asian Test Symposium - 2006”
[9] J.Raik, V.Govind, R.Ubar.
”Test Conﬁgurations for Diagnosing Faulty Links in NoC Switches”
Proc. ETS - 2007
[10] D. A. IIitzky, J. D. Hoﬀman, A. Chun and B. P. Esparza
”Architecture of the Scalable Communications Core’s Network on Chip”
IEEE MICRO - 2007
[11] J.Raik, V.Govind, R.Ubar
”DfT-based External Test and Diagnosis of Mesh-like NoCs”
IET Computers and Digital Techniques - 2009
[12] V.Bertacco, D.Fick, A.DeOrio, J.Hu, D.Blaauw, D.Sylvester
”VICIS: A Reliable Network for Unreliable Silicon”
DAC - 2009
[13] K.Peterson, J.Oberg
”Toward a Scalable Test Methodology for 2D-mesh Network-on-Chip”
DATE - 2007
[14] A.M. Amory, E.Briao, E.Cota, M.Lubaszewski, F.G.Moraes
”A Scalable Test Strategy for Network-on-Chip Routers”
Proc. of ITC-2005
[15] K.Arabi
”Logic BIST and Scan Test Techniques for Multiple Identical Blocks”
IEEE VLSI Test Symnposium 2002
[16] C.Grecu, P.Pande, B.Wang, A.Ivanov, R.Saleh.
”Logic BIST and Scan Test Techniques for Multiple Identical Blocks”
IEEE DFT - 2005
[17] R.Ubar, J.Raik
”Testing Strategies for Network on Chip” ”- book edited by A.Jantsch
and H.Tenhunen, Kluwer Academic Publisher” IEEE DFT - 2003
Bibliography 163
[18] C.Aktouf
”Testing Strategies for Network on Chip”
IEEE Design and Test of Computers - 2002
[19] Y.Lin, C.C.Hsu, A.Y.Wu ”A Scalable Built-In Self-Test/Self-Diagnosis
Architecture for 2D-mesh Based Chip Multiprocessor Systems”
IEEE Int. Symp. on Circuits and Systems - 2009
[20] M.Hosseinabady, A.Banaiyan, M.N.Bojnordi, Z.Navabi
”A Concurrent Testing Method for NoC Switches”
DATE - 2006
[21] C.Grecu, P.Pande, A.Ivanov, R.Saleh
”BIST for Network-on-Chip Interconnect Infrastructures”
VLSI Test Symposium-2006
[22] S.Rodrigo, J.Flich, A.Roca, S.Medardoni, D.Bertozzi, J.Camacho,
F.Silla, J.Duato
”Addressing Manufacturing Challenges with Cost-Eﬀective Fault Tolerant
Routing”
NOCS-2010
[23] Antti Markus, Jaan Raik, Raimund Ubar
”Fast and Eﬃcient Static Compaction of Test Sequences Using Bipartite
Graph Representation”
Proc. of the Second Electronic Circuits and Systems Conference,(ECS)-
1999
[24] F.Clermidy, R.Lemaire, X.Popon, D.Ktenas, Y.Thonnart
Euromicro Conference on Digital System Design-2009
”An Open and Reconﬁgurable Platform for 4G Telecommunication: Con-
cepts and Application”
[25] F.Clermidy, C.Bernard, R.Lemaire, J.Martin, I.Miro-Panades,
Y.Thonnart, P.Vivet, N.Wehn
”A 477mW NoC-based Digital Baseband for MIMO 4G SDR”
ISSCC-2010
164 Bibliography
[26] Y.Thonnart, P.Vivet, F.Clermidy
”A Fully-Asynchronous Low-Power Framework for GALS NoC Integra-
tion”
DATE-2010
[27] R.Dobkin, V.Vishnyakov, E.Friedman, R.Ginosar
”An Asynchronous Router for Multiple Service Levels Networks on Chip”
Proc. of ASYNC -2005
[28] T.Bjerregaard, J.Sparso
”A Router Architecture for Connection-Oriented Service Guarantees in
the MANGO Clockless Network-on-Chip”
DATE-2005
[29] W.Kim, M.S.Gupta, G.Y.Wei and D.Brooks
”System Level Analysis of Fast, Per-Core DVFS using On-Chip Switching
Regulators”
Int. Symp. on High-Performance Computer Architecture-2008
[30] W.Kim, M.S.Gupta, G.Y.Wei and D.Brooks
”System Level Analysis of Fast, Per-Core DVFS using On-Chip Switching
Regulators”
Int. Symp. on High-Performance Computer Architecture-2008
[31] C.Cummings, P.Alfke
”Simulation and Synthesis Techniques for Asynchronous FIFO Design
with Asynchronous Pointer Comparison”
SNUG-2002, San Jose-2002
[32] I.M.Panades, A.Greiner
”Bi-Synchronous FIFO for Synchronous Circuit Communication Well
Suited for Network-on-Chip in GALS Architectures”
Int. Symp. on Networks-on-Chip-2007
[33] T.Ono, M.Greenstreet
”A Modular Synchronizing FIFO for NOCs”
International Network-on-Chip Symposium-2009
Bibliography 165
[34] D.Verbitsky, R.R.Dobkin, R.Ginosar
”A Four-Stage Mesochronous Synchronizer with Back-Pressure and
Buﬀering for Short and Long Range Communications”
International Network-on-Chip Symposium
http://webee.technion.ac.il/ ran/papers
[35] M.Alshaikh,D.Kinniment, A.Yakovlev
”Robust Synchronization using the Wagging Technique”
Technical Report. TR NCL EECE-MSD-TR
[36] F.Mu, C.Svensson
”Self-Tested Self-Synchronization Circuit for Mesochronous Clocking”
”IEEE Trans. on Circuits and Systems II: Analog and Digital Signal
Processing-2001”
[37] A.Sheibanyrad, I.M.Panades, A.Greiner
”Multisynchronous and Fully Asynchronous NoCs for GALS Architec-
tures”
IEEE Design and Test of Computers-2008
[38] M.N.Horak, S.M.Nowick, M.Carlberg, U.Vishkin
”A Low-Overhead Asynchronous Interconnection Network for GALS Chip
Multiprocessors”
ACM/IEEE Int. Symp. on Networks-on-Chip-2010
[39] J.Bainbridge, S.Furber
”CHAIN: a Delay-Insensitive Chip Area Interconnect”
IEEE Micro-2002
[40] L.A.Plana, S.B.Furber, S.Temple, M.Khan, Y.Shi, J.Wu, S.Yang
”A GALS Infrastructure for a Massively Parallel Multiprocessor”
IEEE Design and Test of Computers-2007
[41] E.Beigne, F.Clermidy, P.Vivet, A.Clouard, M.Renaudin
”An Asynchronous NoC Architecture Providing Low Latency Service and
Its Multilevel Design Framework”
IEEE Asynch. Symp.-2005
166 Bibliography
[42] D. Lattard, E. Beigne, F. Clermidy, Y. Durand, R. Lemaire, P. Vivet,
F. Berens
”A Reconﬁgurable Baseband Platform Based on an Asynchronous
Network-on-Chip”
IEEE Journal Of Solid State Circuits-2008
[43] B.Quinton, M.Greenstreet, S.Wilton
”Practical Asynchronous Interconnect Network Design”
IEEE Trans. on VLSI-2008
[44] S.Hollis, S.W.Moore
”Rasp: an Area-Eﬃcient, on-Chip Network”
IEEE Int. Conf. on Computer Design-2006
[45] W.J. Dally and J.W. Poulton
”Digital Systems Engineering”
Cambridge University Press-1998
[46] F.Vitullo, N.E.L’Insalata et al.
”Low-Complexity Link Microarchitecture for Mesochronous Communica-
tion in Networks-on-Chip”
”IEEE Trans. on Computers-2008
[47] I.M.Panades, F.Clermidy, P.Vivet, A.Greiner
”Physical Implementation of the DSPIN Network-on-Chip in the FAUST
Architecture”
ACM/IEEE Int. Symp. on Networks-on-Chip-2008
[48] D.Wentzlaﬀ et al.
book chapter from ”Designing Network On-Chip Architectures in the
Nanoscale Era, edited by J.Flich and D.Bertozzi, CRC Press”
Networks of the Tilera Multicore Processor-2011
[49] S.Murali et al.
”NoC Synthesis Flow for Customized Domain Speciﬁc Multiprocessor
Systems-on-Chip”
IEEE Trans. on Parallel and Distributed Systems-2005
Bibliography 167
[50] D. Ludovici, A. Strano, G. N. Gaydadjiev and D. Bertozzi
”Mesochronous NoC Technology for Power-Eﬃcient GALS MPSoCs”
INAOCMC-2011
[51] S.Beer, R.Ginosar, M.Priel, R.R.Dobkin, A.Kolodny
”The Devolution of Synchronizers”
ASYNCH-2010
[52] A.Strano, D.Ludovici, D.Bertozzi
”A Library of Dual-Clock FIFOs for Cost-Eﬀective and Flexible MP-
SoCs”
Proc. of SAMOS-2010
[53] D. Ludovici and A. Strano and G. N. Gaydadjiev and L. Benini and D.
Bertozzi
”Design Space Exploration of a Mesochronous Link for Cost-Eﬀective and
Flexible GALS NOCs”
Proceedings of Design, Automation and Test in Europe-2010
[54] D. Ludovici, A. Strano, D. Bertozzi
”Architecture design principles for the integration of synchronization in-
terfaces into Network-on-Chip switches”
NoCArc: Proceedings of the 2nd International Workshop on Network on
Chip Architectures-2009
[55] Kakoee, M.R. and Loi, I. and Benini, L.
”A New Physical Routing Approach for Robust Bundled Signaling on NoC
Links”
Proceedings of the 20th Great Lakes Symposium on VLSI-2010
[56] M. Krstic, X. Fan, C. Wolf, A. Strano, D. Bertozzi
”A New Physical Routing Approach for Robust Bundled Signaling on NoC
Links”
Deliverable D29 - Test and Measurement Report of Moonrake Chip,
Galaxy Project-2010
www.galaxy-project.org
168 Bibliography
[57] Abts, D., Enright Jerger, N.D., Kim, J., Gibson, D., Lipasti, M.H.
”Achieving predictable performance through better memory controller
placement in many-core CMPs”
ACM SIGARCH Computer Architecture News 37(3), 451 (Jun 2009)
http://portal.acm.org/citation.cfmdoid=1555815.1555810
[58] Das, R., Mutlu, O., Kumar, A., Azimi, M.
”Application-to-core mapping policies to reduce interference in on-chip
networks”
Tech. rep., SAFARI Technical Report No. 2011 (2011)
http://www.ece.cmu.edu/omutlu/pub/interference-aware-noc-mapping-
TR-SAFARI-2011-001.pdf
[59] Trivino, F., Sanchez, J.L., Alfaro, F.J., Flich, J.
” Virtualizing network-on-chip resources in chip-multiprocessors”
Microprocessors and Microsystems 35(2), 230245 (Mar 2011)
http://linkinghub.elsevier.com/retrieve/pii/S0141933110000712
[60] Gratz, P., Grot, B., Keckler, S.W.
”Regional congestion awareness for load balance in networks-on-chip”
HPCA. pp. 203214. IEEE Computer Society (2008)
[61] Li, M., Zeng, Q.A., Jone, W.B.
”DyXY: a proximity congestion-aware deadlock-free dynamic routing
method for network on chip.”
”Proceedings of the 43rd annual Design Automation Conference. pp.
849852. DAC 06, ACM, New York,2006
http://doi.acm.org/10.1145/1146909.1147125
[62] Marescaux, T., Rangevall, A., Nollet, V., Bartic, A., Corporaal, H.
”Distributed congestion control for packet switched networks on chip”
Proceedings of the International Conference of Parallel Computing:
Current Future Issues of High-End Computing. vol. 33, pp. 761768-2005
http://citeseerx.ist.psu.edu
Bibliography 169
[63] Wu, D., Al-Hashimi, B.M., Schmitz, M.T.
”Improving routing eﬃciency for network-on-chip through contention-
aware input selection”
Proceedings of the 2006 Asia and South Paciﬁc Design Automation
Conference. pp. 3641:
ASP- DAC 06, IEEE Press, Piscataway, NJ, USA (2006),
http://dx.doi.org
[64] Sanchez, D., Michelogiannakis, G., Kozyrakis, C.
”An analysis of on-chip interconnection networks for large-scale chip mul-
tiprocessors”
ACM Transactions on Architecture and Code Optimization (TACO) 7(1),
4 (2010)
http://portal.acm.org/citation.cfmid=1736069
[65] Das, R., Mutlu, O., Moscibroda, T., Das, C.R.
”Application-aware prioritization mechanisms for on-chip networks”
Proceedings of the 42nd Annual IEEE/ACM International Symposium on
Microarchitecture - Micro-42 p. 280 (2009)
http://portal.acm.org/citation.cfmdoid=1669112.1669150
[66] Grot, B., Keckler, S., Mutlu, O.
”Preemptive virtual clock: a ﬂexible, eﬃcient, and cost-eﬀective QOS
scheme for networks-on-chip”
Proceedings of the 42nd Annual IEEE/ACM International Symposium on
Microarchitecture. pp. 268279. ACM (2009)
http://portal.acm.org/citation.cfmid=1669149
[67] Iyer, R., Zhao, L., Guo, F., Illikkal, R., Makineni, S., Newell, D., Soli-
hin, Y., Hsu, L., Reinhardt, S.
”QoS policies and architecture for cache/memory in CMP platforms.”
ACM SIGMETRICS Performance Evaluation Review 35(1), 25 (Jun
2007)
http://portal.acm.org/citation.cfmdoid=1269899.1254886
170 Bibliography
[68] Flich, J., Bertozzi, D.
”Designing Network On-Chip Architectures in the Nanoscale Era.”
Chapman & Hall/CRC (2010)
[69] Chen, G., Li, F., Son, S.W., Kandemir, M.
”Application mapping for chip multiprocessors.” Proceedings of the 45th
annual conference on Design automation - DAC 08 p. 620 (2008)
http://portal.acm.org/citation.cfmdoid=1391469.1391628
[70] Das, R., Mutlu, O., Kumar, A., Azimi, M.
”Application-to-core mapping policies to reduce interference in on-chip
networks”
Tech. rep., SAFARI Technical Report No. 2011 (2011)
http://www.ece.cmu.edu/omutlu/pub/interference-aware-noc-mapping-
TR-SAFARI-2011-001.pdf
[71] Rodrigo, S., Flich, J., Roca, A., Medardoni, S., Bertozzi, D., Camacho,
J., Silla, F., Duato, J.
”Addressing manufacturing challenges with cost-eﬃcient fault tolerant
routing.”
NOCS 10: Proceedings of the 4th ACM/IEEE International Symposium
on Networks-on-Chip. pp. 2532 (2010)
[72] Gilabert, F., Gomez, M.E., Medardoni, S., Bertozzi, D.
”Improved utilization of noc channel bandwidth by switch replication for
cost-eﬀective multi-processor systems-on-chip”
”Proceedings of the 2010 Fourth ACM/IEEE International Symposium
on Networks-on-Chip. pp. 165172.”
IEEE Computer Society, Washington, DC, USA (2010)
http://dx.doi.org/10.1109/NOCS.2010.25
[73] Ubal, R., Sahuquillo, J., Petit, S., Lopez, P.
”Multi2Sim: A Simulation Framework to Evaluate Multicore-
Multithreaded Processors”
Proc. of the 19th Intl Symposium on Computer Architecture and High
Performance Computing (2007)
Bibliography 171
[74] ”NaNoC: NaNoC design platform.”
http://www.nanoc-project.eu
[75] ”Roca, S., Flich, J., Silla, F., Duato, J.”
”VCTlite: Towards an eﬃcient implementation of virtual cut-through
switching in on-chip networks”
International Conference on High Performance Computing (HiPC). pp.
112 (2010)
[76] Mejia A., Flich, J., Duato, J., Reinemo, S.A., Skeie, T.
”Segment-based routing: An eﬃcient fault-tolerant routing algorithm for
meshes and tori”
International Parallel and Distributed Processing Symposium 0, 84 (2006)
[77] ”Multi2sim Wiki: SPLASH2 execution commands.”
http://www.multi2sim.org/wiki/index.php5/SPLASH2 Execution Com-
mands
[78] Rijpkema, E.; Goossens, K.; Radulescu, A.
”Trade Oﬀs in the Design of a Router with Both Guaranteed and Best-
Eﬀort Services for Networks on Chip.”
DATE03, Mar. 2003, pp. 350-355.
[79] Jose Flich, Davide Bertozzi.
”Designing Network On-Chip Architectures in the Nanoscale Era”
by Chapman and Hall/CRC (2010).
[80] Diego Melpignano, Luca Benini, Eric Flamand, L. Benini, Bruno Jego,
Thierry Lepley, Germain Haugou, Fabien Clermidy , Denis Dutoit.
”Platform 2012, a Many-Core Computing Accelerator for Embedded
SoCs: Performance Evaluation of Visual Analytics Applications”
DAC 2012, June 3-7, 2012, San Francisco, California, USA.
[81] Peter Mandl, Udeepta Bordoloi
”General-purpose Graphics Processing Units Deliver New Capabilities to
the Embedded Market”
http://www.amd.com/tw/Documents/GPGPU-Embedded.pdf
172 Bibliography
[82] S.Stergiou et al.
”Xpipes Lite: a Synthesis Oriented Design Library for Networks on
Chips”
DAC, pp.559-564, 2005.
[83] A.Strano, F.Trivino, Jose L. Sanchez, Jose Flich, D.Bertozzi
”OSR-Lite: Fast and Deadlock-Free NoC Reconguration Framework
SAMOS 2012.
[84] ”S.Rodrigo et Al.”
”Addressing Manufacturing Challenges with Cost-Eﬀective Fault Tolerant
Routing”
NOCS 2010, pp.35-32, 2010.
[85] S.Terenzi, A.Strano, D.Bertozzi
”Optimizing Built-In Pseudo-Random Self-Testing for Network-on-Chip
Switches”
INA-OCMC 2012.
[86] M.Radetzki, C.Feng, X.Zhao, and A.Jantsch.
”Methods for fault tolerance in networks on chip.
ACM Computing Surveys - 2012.
[87] A.Ghiribaldi, A.Strano, M.Favalli, D.Bertozzi.
”Power Eﬃciency of Switch Architecture Extensions for Fault Tolerant
NoC Design.”
IGCC12, 2012, California, USA.
[88] O. Lysne, et Al.
”An eﬃcient and deadlock-free network reconﬁguration protocol”
IEEE Transactions of Computers, pp.762779, 2008.
[89] A. Ghiribaldi, D. Ludovici, M. Favalli, D. Bertozzi.
”System-Level Infrastructure for Boot-time Testing and Conﬁguration of
Networks-on-Chip with Programmable Routing Logic”.
VLSI-SoC, 2011
Bibliography 173
[90] Jared C. Smolens, Brian T. Gold, James C. Hoe, Babak Falsaﬁ, and
Ken Mai
”Detecting Emerging Wearout Faults” SELSE, 2007.
[91] D.Wentzlaﬀ et al.
IEEE Micro, vol.27, no.5, pp.15-31, 2007.
[92] D. A. IIitzky, J. D. Hoﬀman, A. Chun and B. P. Esparza
”Architecture of the Scalable Communications Cores Network on Chip”.
IEEE Micro, vol.27, Issue 5, pp.62 - 74,2007.
[93] S.Vangal et al.,
”An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS”.
IEEE Journal of Solid-State Circuits, Vol.43, Issue 1, pp.29-41,2008.
[94] E. Flamand
”Strategic Directions Towards Multicore Application Speciﬁc Computing”.
Proc. of DATE, pp.1266, 2009.
[95] M. Krstic et al;
”Globally Asynchronous, Locally Synchronous Circuits: Overview and
Outlook”.
IEEE Design and Test of Computers, vol. 24, no. 5, pp. 430-441, 2007.
[96] S.Borkar
”Design Perspectives on 22nm CMOS and Beyond”
Proc. of DAC 2009.
[97] R.Dobkin, V.Vishnyakov, E.Friedman, R.Ginosar
”An Asynchronous Router for Multiple Service Levels Networks on Chip”
Proc. of ASYNC, pp.44-53, 2005.
[98] T.Bjerregaard, J.Sparso
”A Router Architecture for Connection-Oriented Service Guarantees in
the MANGO Clockless Network-on-Chip”
Proc. of DATE, pp.1226-1231, 2005.
174 Bibliography
[99] S. Stergiou, F. Angiolini, S. Carta, L. Raﬀo, D. Bertozzi, G. De Micheli
”xpipes Lite: a Synthesis Oriented Design Library for Networks on Chips”
Proc. of DATE, pp.11881193, 2005.
[100] Kakoee, M.R. and Loi, I. and Benini, L.
”A New Physical Routing Approach for Robust Bundled Signaling on NoC
Links”
Proc. of GLSVLSI, pp.3-8, 2010.
[101] D. Ludovici and A. Strano and G. N. Gaydadjiev and L. Benini and
D. Bertozzi
”Design Space Exploration of a Mesochronous Link for Cost-Eﬀective and
Flexible GALS NOCs”
Proc. of DATE, pp.679-684, 2010.
[102] D. Ludovici, A. Strano, D. Bertozzi
”Architecture design principles for the integration of synchronization in-
terfaces into Network-on-Chip switches”
Proc. of NoCArc, pp.31-36, 2009.
[103] D. Ludovici, A. Strano, D. Bertozzi, L. Benini, G.N. Gaydadjiev
”Comparing tightly and loosely coupled mesochronous synchronizers in a
NoC switch architecture”
Proc. of NOCS, pp.244-249, 2009.
[104] A. Strano and D. Ludovici and D. Bertozzi
”A Library of Dual-Clock FIFOs for Cost-Eﬀective and Flexible MPSoCs
Design”
Proc. of SAMOS, 2010.
[105] T.N.K.Jain
”Asynchronous Bypass Channels: Improving Performance for Multi-
Synchronous NoCs”
Proc. of NOCS, pp.51-58, 2010.
[106] F.Vitullo et al.
”Low-Complexity Link Microarchitecture for Mesochronous Communica-
Bibliography 175
tion in Networks-on-Chip”
IEEE Trans. on Computers, Vol.57, issue 9, pp.1196-1201, 2008.
[107] D. Mangano, R. Locatelli, A. Scandurra, C. Pistritto, M. Coppola, L.
Fanucci, F. Vitullo, D. Zandri
”Skew Insensitive Physical Links for Network on Chip”
Proc of NANO-NET, 2006.
[108] I.M.Panades, F.Clermidy, P.Vivet, A.Greiner
”Physical Implementation of the DSPIN Network-on-Chip in the FAUST
Architecture”
Proc. of NOCS, pp.139-148, 2008.
[109] F.Clermidy, R.Lemaire, X.Popon, D.Ktenas, Y.Thonnart
”An Open and Reconﬁgurable Platform for 4G Telecommunication: Con-
cepts and Application”
Proc of DSD, pp.62-74,2009.
[110] F.Clermidy, C.Bernard, R.Lemaire, J.Martin, I.Miro-Panades,
Y.Thonnart, P.Vivet, N.Wehn
”A 477mW NoC-based Digital Baseband for MIMO 4G SDR”
ISSCC2010, pp.278-279, 2010.
[111] Y.Thonnart, P.Vivet, F.Clermidy
”A Fully-Asynchronous Low-Power Framework for GALS NoC Integra-
tion”
Proc. of DATE, pp.33-38, 2010.
[112] C. Batten, A. Joshi, V. Stojanovic, K. Asanovic.
”Designing chiplevel nanophotonic interconnection networks.”
Emerging and Selected Topics in Circuits and Systems, IEEE Journal.
vol. 2, no. 2, pp. 137-153, 2012.
[113] S. Beamer, C. Sun, Y. Kwon, A. Joshi, C. Batten, V. Stojanovic, K.
Asanovic.
”Re-architecting DRAM memory systems with monolithically integrated
silicon photonics.”
ISCA 2010.
176 Bibliography
[114] J. Leu, V. Stojanovic.
” Injection-locked clock receiver for monolithic optical link in 45nm SOI.”
(A-SSCC), 2011 IEEE Asian, 2011, pp. 149-152.
[115] Luca Ramini Paolo Grani, Herve Tatenguem Fankem, A.Ghiribaldi,
S.Bartolini, D.Bertozzi
Assessing the Energy Break-Even Point between an Optical NoC Archi-
tecture and an Aggressive Electronic Baseline
DATE-2014, Dresden-Germany
