Reactive and Proactive Fault-Tolerant Network-on-Chip Architectures using Machine Learning by DiTomaso, Dominic F.
Reactive and Proactive Fault-Tolerant Network-on-Chip Architectures using Machine
Learning
A dissertation presented to
the faculty of
the Russ College of Engineering and Technology of Ohio University
In partial fulfillment
of the requirements for the degree
Doctor of Philosophy
Dominic F. DiTomaso
December 2015
© 2015 Dominic F. DiTomaso. All Rights Reserved.
2This dissertation titled
Reactive and Proactive Fault-Tolerant Network-on-Chip Architectures using Machine
Learning
by
DOMINIC F. DITOMASO
has been approved for
the School of Electrical Engineering and Computer Science
and the Russ College of Engineering and Technology by
Avinash Kodi
Associate Professor of Electrical Engineering and Computer Science
Dennis Irwin
Dean of Russ College of Engineering and Technology
3Abstract
DITOMASO, DOMINIC F., Ph.D., December 2015, Electrical Engineering and Computer
Science
Reactive and Proactive Fault-Tolerant Network-on-Chip Architectures using Machine
Learning (116 pp.)
Director of Dissertation: Avinash Kodi
Chip multiprocessors (CMPs) have emerged as the standard computer design to
overcome the high power limitations and high performance demands of modern processing.
Tens to thousands of cores at low frequencies (1-2 GHz) operate together to outperform
single core processors. In order for the cores to efficiently communicate, a communication
infrastructure called the network-on-chip (NoC) is required. The NoC uses modular router
and link components to route data across the chip. However, as transistor technology
scales down, more and more cores are being integrated into the NoC which leads to
power and performance concerns due to high buffering power and under-utilized links.
Moreover, the smaller transistors along with effects such as wear-out and device aging
leads to serious reliability concerns in the NoC. Commonly used reactive fault-tolerant
techniques, which are employed after the error has affected the system, can be most
effective against hard, or permanent, errors. Proactive fault-tolerant techniques, on the other
hand, can be used to prevent or avoid errors before they occur which can be most effective
against soft, or transient, errors. In this dissertation, two separate but related fault-tolerant
architectures are presented: 1) QORE - A reactive power-efficient/high performance fault-
tolerant architecture for hard errors and 2) A proactive prediction/mitigation fault-tolerant
architecture for soft errors. Both architectures provide fault-tolerance and both benefit from
machine learning (ML) techniques but in different ways.
QORE uses Multi-Function Channel (MFC) buffers and their associated control (link
and fault controllers) to provide fault-tolerance by allowing the NoC to dynamically
4adapt to faults at the link level and reverse propagation direction to avoid faulty links.
Additionally, MFC buffers reduce router power and improve performance by eliminating
in-router buffering. A ML technique is used in the link controllers to predict the direction
of traffic flow in order to more efficiently reverse links. Simulation results using real
benchmarks and synthetic traffic mixes show that QORE improves speedup by 1.3× and
throughput by 2.3× when compared to state-of-the art fault tolerant NoCs designs such
as Ariadne and Vicis. Moreover, results from the Synopsys Design Compiler show that
network power in QORE is reduced by 21% with minimal control overhead.
In the prediction/mitigation design, several effects such as process-voltage-temperature
variations and device wear-out are combined to create data sets which can be used in a pre-
diction model. ML techniques are used on the data sets to train a decision tree which can
be used to predict faults efficiently in the network. Based on the prediction model, the pre-
dicted faults are dynamically mitigated through error correction codes (ECC) and relaxed
timing transmission. Results indicate that, on an average, timing errors can be accurately
predicted 32.4% better than other labeling techniques resulting in a 23.3% reduction in re-
transmitted packets, a net speedup of 3.47×, and an energy savings of 41.9% over other
designs for real traffic patterns.
5I dedicate my dissertation to my parents, sister, girlfriend, and all other family members
for their love and support.
6Acknowledgments
I would like to first thank my advisor, Dr. Avinash Kodi, for his support, patience, and
hard work. I would like to thank my committee members for their time and input. Finally,
I thank my labmates for their help and for the entertainment they provided.
7Table of Contents
Page
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
List of Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.1 Network-on-Chips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.1.1 Router Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . 17
1.1.2 Router Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.2 Power in NoCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3 Performance in NoCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.4 Reliability in NoCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.5 Proactive Fault-Tolerant Techniques and Machine Learning . . . . . . . . . 28
1.6 Major Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.7 Organization of Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . 32
2 QORE: A Fault Tolerant NoC Architecture . . . . . . . . . . . . . . . . . . . . . 33
2.1 Multi-Function Channel Buffers . . . . . . . . . . . . . . . . . . . . . . . 33
2.2 MFC without Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.3 MFC with Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.4 Link and Fault Controllers . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.5 Improving Link Controllers with ML . . . . . . . . . . . . . . . . . . . . . 46
2.6 Router Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.7 Deadlock Avoidance and Reliability Concerns . . . . . . . . . . . . . . . . 51
3 Prediction and Mitigation of Soft Errors in NoCs . . . . . . . . . . . . . . . . . 54
3.1 Design of Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.1.1 Fault Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.1.2 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.3 Error Mitigation and Router Microarchitecture . . . . . . . . . . . . . . . . 64
84 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.1 QORE Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.1.1 Power, Area, and Timing Overhead . . . . . . . . . . . . . . . . . 74
4.1.2 Speedup on Real Applications . . . . . . . . . . . . . . . . . . . . 76
4.1.3 Network Throughput . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.1.4 Packet Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.1.5 Network Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.1.6 Network Performance of Reversibility . . . . . . . . . . . . . . . . 84
4.1.7 Sensitivity Study: Varying Number of Links . . . . . . . . . . . . . 87
4.1.8 Accuracy of Decision Trees . . . . . . . . . . . . . . . . . . . . . 89
4.2 Soft Error Prediction with Mitigation Results . . . . . . . . . . . . . . . . 91
4.2.1 Overhead of Error Mitigation and Predictors . . . . . . . . . . . . 94
4.2.2 Trained Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . 96
4.2.3 Performance of Decision Trees . . . . . . . . . . . . . . . . . . . . 98
4.2.4 Network Performance . . . . . . . . . . . . . . . . . . . . . . . . 101
5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
9List of Tables
Table Page
1.1 Work related to buffer power optimizations. . . . . . . . . . . . . . . . . . . . 22
1.2 Work related to crossbar (xbar) power optimizations. . . . . . . . . . . . . . . 22
3.1 Model application link utilization and temperature pattern [1]. . . . . . . . . . 59
3.2 Latency overheads of each mitigation technique given the number of timing
errors [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.3 Energy overheads of each mitigation technique given the number of timing
errors [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.1 Cache and core parameters used for Splash-2, PARSEC, and SPEC2006
application suite simulation [2] ©2014 IEEE [3] ©2015 IEEE. . . . . . . . . . 73
4.2 Power overhead for the components of one router [2] ©2014 IEEE [3] ©2015
IEEE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3 Area overhead for the components of one router [2] ©2014 IEEE [3] ©2015
IEEE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4 Breakdown of traffic mixes [2] ©2014 IEEE [3] ©2015 IEEE. . . . . . . . . . 79
4.5 Accuracy of four different decision trees [3] ©2015 IEEE. . . . . . . . . . . . 92
4.6 Overhead of router components [1]. . . . . . . . . . . . . . . . . . . . . . . . 94
4.7 Error Mitigation and Predictor Overheads [1]. . . . . . . . . . . . . . . . . . . 95
4.8 Confusion matrix for +x link of router 0 [1]. . . . . . . . . . . . . . . . . . . . 99
4.9 Confusion matrix averaged over all decision trees [1]. . . . . . . . . . . . . . . 100
4.10 Accuracy, F-score, and per-hop retransmit percent due to timing errors for our
decision trees (DTs) compared to other labeling techniques [1]. . . . . . . . . . 101
10
List of Figures
Figure Page
1.1 Power consumption and clock rate of different generation Intel processors over
the last 30 years [4]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2 Examples of different topologies: (a) Concentrated mesh topology, (b)
Flattened Butterfly topology, (c) mesh NoC, and (d) router details. . . . . . . . 18
1.3 (a) Dynamic and (b) Leakage power of router crossbar, conventional buffers,
and channel buffers [2] ©2014 IEEE [3] ©2015 IEEE. . . . . . . . . . . . . . 21
1.4 Link utilization of a router on two multicore applications from the SPLASH-2
and Parsec benchmark suites [2] ©2014 IEEE [3] ©2015 IEEE. . . . . . . . . 24
1.5 Transistors scaling along with device wear-out, device aging, etc. can lead to
faults in NoCs. Modern NoCs must have mechanisms in place to overcome
such faults. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.6 HCI effect during switching of both nMOSFET and pMOSFET and NBTI
effect on pMOSFET for (a) high switching activity, low duty cycle and (b)
low switching activity, high duty cycle (Figure from [5]). . . . . . . . . . . . . 26
1.7 Link failures with reversible channel buffers, with conventional channel
buffers, and without channel buffers [2] ©2014 IEEE [3] ©2015 IEEE. . . . . 28
1.8 Other research related to fault tolerant NoCs: (a) Ariadne [6] and (b) Vicis [7]. . 29
2.1 (a) Conventional channel buffer, (b) our reversible channel buffer, and (c)
storage and propagation for both forward and backward links [2] ©2014 IEEE
[3] ©2015 IEEE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.2 State diagram for MFC control block. . . . . . . . . . . . . . . . . . . . . . . 37
2.3 Discharge time of channel buffer using 130 nm transistors in the Virtuoso
Analog Design Environment from the Cadence tools [2] ©2014. . . . . . . . . 38
2.4 QORE’s four reversible router links each consisting of two channel buffer lines
[2] ©2014 IEEE [3] ©2015 IEEE. . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5 Layout of QORE showing links configured to an arbitrary traffic pattern [2]
©2014 IEEE [3] ©2015 IEEE. . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.6 Link direction naming convention and link status tables which are located in
the router and used to store statistics on each local link [2] ©2014. . . . . . . . 43
2.7 (a) Block diagrams of link controller (LC) and fault controller (FC) and (b)
example of fault adaptability for a various number of faults [2] ©2014. . . . . . 44
2.8 An example decision tree [3] ©2015 IEEE. . . . . . . . . . . . . . . . . . . . 47
2.9 List of possible features for predicting traffic on the +x links of router 0 [3]
©2015 IEEE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.10 Router Microarchitecture showing inputs/outputs, LC, FC, and RC [2] ©2014
IEEE [3] ©2015 IEEE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
11
3.1 Our process to create features and labels from raw data collected by combining
several fault models [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2 Example of a generic decision tree with outcomes that determine the predicted
error type [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.3 Example decision tree built using the ID3 algorithm. . . . . . . . . . . . . . . 62
3.4 Steps to build the example decision tree. Step 1: Feature with the highest
information gain becomes the root of the tree. Step 2a: Data set is partitioned
on the feature values of the root. Step 2b: Root feature is removed from the
feature list. Step 3: Recursively call ID3 on each newly partitioned data set.
Repeat all steps until the tree is completed [1]. . . . . . . . . . . . . . . . . . . 63
3.5 Examples of CRC, SECDED, and relax transmission mitigation techniques [1]. 65
3.6 Router microarchitecture showing the feature table, predictors, encoders,
decoders, and CRC blocks [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.7 Example of packet transmission with and without prediction [1]. . . . . . . . . 71
4.1 Speedup relative to Vicis with varying number of faults where BAR (B), QORE
(Q), Ariadne (A) relative to Vicis (V) for 64 cores on (a) FFM and FFT apps
and (b) bzip and freqmine apps [2] ©2014 IEEE [3] ©2015 IEEE. . . . . . . . 77
4.2 Speedup relative to Vicis with varying number of faults where BAR (B), QORE
(Q), Ariadne (A) relative to Vicis (V) for 64 cores on (a) LU and Ocean apps
and (b) streamcluster and swaptions apps [2] ©2014 IEEE [3] ©2015 IEEE. . . 78
4.3 Average number of subnetworks of QORE compared to Ariadne and Vicis [2]
©2014 IEEE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.4 Saturation throughput for varying percentage of link failures for different traffic
mixes for BAR (B), QORE (Q), Ariadne (A) and Vicis (V) [2] ©2014 IEEE
[3] ©2015 IEEE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.5 Latency plots for traffic mix 1 for 0-20% faults [2] ©2014 IEEE [3] ©2015
IEEE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.6 Latency plots for traffic mix 1 for 30-50% faults [2] ©2014 IEEE [3] ©2015
IEEE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.7 Network power for different traffic mixes for BAR (B), QORE (Q), Ariadne
(A) and Vicis (V) [2] ©2014 IEEE [3] ©2015 IEEE. . . . . . . . . . . . . . . 85
4.8 Buffer power for traffic mix 1 [3] ©2015 IEEE. . . . . . . . . . . . . . . . . . 86
4.9 Effect Rw on saturation throughput for varying traffic mixes [2] ©2014 IEEE. . 87
4.10 Saturation throughput for varying percentage of link failures for different traffic
mixes for QORE of N values: 2 links (Q2), 4 links (Q4), and 8 links (Q8) [3]
©2015 IEEE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.11 (a) Decision tree for the +x links of router 0 and (b) decision tree for the +y
links of router 1 [3] ©2015 IEEE. . . . . . . . . . . . . . . . . . . . . . . . . 90
4.12 Concentrated mesh with network parameters [1]. . . . . . . . . . . . . . . . . 94
4.13 Decision trees built using the ID3 algorithm for +x link of router 0 [1]. . . . . . 97
4.14 Decision trees built using the ID3 algorithm for +y link of router 0 and +x link
of router 1 [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
12
4.15 Percent of packets that require full retransmission [1]. . . . . . . . . . . . . . . 102
4.16 Application speedup relative to M-always [1]. . . . . . . . . . . . . . . . . . . 103
4.17 Breakdown of energy per flit for different design [1]. . . . . . . . . . . . . . . 104
13
List of Acronyms
ANN - Artificial Neural Networks
BAR - Bandwidth Adaptive Router
BFLY - Butterfly
BIST - Built-in Self Test
BR - Bit Reversal
BW - Buffer Write
CMesh - Concentrated Mesh
CMP - Chip Multi-Processor
COMP - Complement
CRC - Cyclic Redundancy Check
DT - Decision Trees
ECB - Elastic Buffering
ECC - Error Correction Codes
EM - Electromigration
ET - Error Type
FBfly - Flattened Butterfly
FC - Fault Controller
HCI - Hot Carrier Injection
HPC - High Performance Computing
I/O - Input/Output
ILP - Instruction Level Parallelism
LC - Link Controller
LST - Link Status Table
MFC - Multi-Function Channel Buffer
ML - Machine Learning
14
MT - Matrix Transpose
NBTI - Negative-bias Temperature Instability
NoC - Network-on-Chip
NUR - Non-Uniform Random
PS - Perfect Shuﬄe
RC - Route Computation
Rel - Release
Rev - Reverse
RT - Relaxed Transmission
SA - Switch Allocation
SECDED - Single Error Correction, Double Error Detection
SIMD - Single Instruction Multiple Data
ST - Switch Traversal
UR - Uniform Random
VA - Virtual Channel Allocation
VC - Virtual Channel
15
1 Introduction1
The processing core is an essential component of any computing system. By
communicating with memory, caches, and input/output (I/O) devices, a processing core
can execute a variety of applications in areas such as particle physics, drug discovery, and
genetics [8–11]. The increasingly large-scale input data and high computational complexity
of these applications create a demand for high performance computing (HPC) systems.
However, one of the main limitations to HPC systems is the high power consumption across
the small chip [4, 8, 12–14]. Figure 1.1 shows the power consumption and clock rates of
several Intel processors over the past three decades. Initially, clock rates were increased
to improve performance of the processor. As power is directly related to frequency, the
power also increased with clock rates. The Pentium 4 processor greatly increased the clock
rate as well as power; however, there were diminishing returns on performance. In 2004,
clock rates peaked at 3.6 GHz; however, the thermal issues of the Prescott led to the Core
2 which uses multiple processors and lower clock rates. Since after 2007, processor clock
speeds have been constrained to approximately 2-3 GHz and other efforts have been made
in order to reduce power consumption [4, 15, 16]. To compensate for the slower clock
speeds, modern computing systems apply parallelism.
In order to improve the performance of a processing core, computer architects
have employed parallel techniques such as instruction level parallelism (ILP), single
instruction multiple data (SIMD), and multithreading [4, 16–18]. ILP executes independent
instructions simultaneously to improve performance by using techniques such as processor
pipelining. However, ILP is limited by the number of independent instructions in the
program. SIMD is another technique commonly applied when performing the same
operation on many data points. SIMD groups multiple operations together so that data
1 Some material in this dissertation was used verbatim from my publications [2, 3] with permission
©2014, 2015 IEEE and a submitted publication awaiting decision [1].
16
Figure 1.1: Power consumption and clock rate of different generation Intel processors over
the last 30 years [4].
can be loaded and stored in blocks resulting in lower energy, latency, and complexity.
Programs can be split into multiple threads which can independently executed to further
take advantage of ILP. Multithreading is beneficial in that memory latencies can be hidden.
These techniques show the usefulness of parallelism in computing systems.
However, in order to continue performance improvements and to limit the excessive
power consumption of the integrated circuit, or chip, computer architects have shifted
to chip multiprocessor (CMP) designs [19, 20]. CMPs are architectures which integrate
multiple processing cores on a single chip. Each processing core executes at a lower
frequency [13]; however, the multiple processing cores work together to improve
application execution times. With continued improvements in silicon technology,
transistors sizes continue to decrease which enable hundreds to thousands of cores
integrated on a single chip. Some examples of CMPs are Intel’s 48-core SCC [21], Tilera’s
72-core TILE-Gx [22], Kalray’s 256-core MPPA [23], and NVIDIA’s 512-core Fermi chip
17
[24]. The network-on-chip (NoC) is the communication infrastructure which allows for
communication between all of the processing cores in a CMP.
1.1 Network-on-Chips
In a NoC, nodes, which can be processing cores or caches, use a series of routers
connected via links to send information. These routers and links can be configured in
various topologies. Figure 1.2(a) shows one common topology called a concentrated mesh
(CMesh) [25]. CMesh connects routers in a grid-like, or mesh, topology and each router is
connected, or concentrated, to four cores. A CMesh can be connected to a various number
of cores but four is the most common as this number balances performance with router
size. A CMesh with a concentration of one is simply known as a mesh topology. Higher
concentration results in more contention for router resources but higher concentration also
results in lower latency due to a reduction in average hop count. The hop count is defined is
the number of link travels required for the data to reach its destination. Another topology is
the Flattened Butterfly (FBfly) [26] which is shown in Figure 1.2(b). FBfly fully connects
routers in the same row and routers in the same column. This increases the size of the
router but reduces the maximum hop count, called network diameter, to two hops. FBfly is
a low latency network but scales poorly. Figure 1.2(c) shows an example mesh NoC with
nine routers. Information is sent in the form of data packets from one router to another
until the destination is reached. NoCs have been shown to improve CMP performance over
conventional bus-based networks [25, 27, 28].
1.1.1 Router Microarchitecture
The router is an essential piece of the NoC as well as the most complicated. The
purpose of the router is to store, route, and switch data from an input port to an output
port [18]. Routers operate on units of information called flits which are subdivision of data
packets. Figure 1.2(d) shows the details of the router microarchitecture. A router consists
18
=
(b)(a)
Processing 
Core
Router
Link
6 7 8
3 4 5
0 1 2
Out0
Outm
In0
Inn
VCs
VCs
Credits InCredits Out
Processing Core(s)
RC
VA
SA
Crossbar
(nxm)
(c) (d)
. . .
. . .
…
…
…
Figure 1.2: Examples of different topologies: (a) Concentrated mesh topology, (b)
Flattened Butterfly topology, (c) mesh NoC, and (d) router details.
of a number of input ports and output ports depending on the topology and number of
cores/caches connected to the router. Flits entering a router are stored in a buffer while
they wait for arbitration. Typically, routers will use multiple, parallel buffers called virtual
channels (VCs) to avoid deadlocks and improve performance [29]. A crossbar is a switch
that fully connects input buffers to output ports. The router connected to the desired output
port is called the downstream router. Credits are typically used to keep track of the number
of free buffers at the downstream router so that the buffers do not overflow. The control
19
blocks, shown as teal rectangles in Figure 1.2(d), control which flits can use the crossbar
and when they can use the crossbar. Flits use these control blocks to move through the
network in a pipeline fashion.
1.1.2 Router Pipeline
There are typically five stages in a router pipeline. The first stage is buffer write (BW)
in which a flit enters the router and is written to a VC. The next stage is route computation
(RC). In this stage, control information is read from the flit to determine which output port
the flit needs to be switched to. Next, a VC at the downstream router is allocated for a whole
packet in the virtual channel allocation (VA) stage. Once a packet allocates a downstream
VC, it can contend for use of the crossbar in the switch allocation (SA) stage. The SA
control block fairly selects one flit from each input port to contend for an output port. If
multiple flits contend for the same output port, the SA block selects one winner and the
losers can contend during the next clock cycle. Finally, flits which win arbitration traverse
the switch in the switch traversal (ST) stage.
Some advanced techniques can be implemented to reduce the length of the router
pipeline. Look-ahead routing [30] can reduce the router pipeline to four stages. By sending
routing information ahead of the flit to the downstream router, the BW and RC stages can be
done in parallel which will reduce the router pipeline. Additionally, speculation techniques
[31–33] can also be implemented to reduce the router pipeline to three stages. Speculation
assumes that the switch will be available so that the VA and SA stages can be performed
in parallel. Speculation is effective under low traffic loads but can be ineffective at high
traffic loads due to additional delays incurred when speculation is incorrect and the switch
is unavailable.
20
1.2 Power in NoCs
In NoCs, segments of links are connected via routers in order to overcome global wire
delays and scalability requirements. However, the combination of links and routers incur
a power and area expense which adversely affects NoC performance [2, 3]. Research has
shown that router buffers occupy for 30% of router area [34] and dissipate 46% of router
power [35]. Extensive power optimization techniques have been used to mitigate the NoC
power consumption [2, 3]. For example, the power consumption of the network in Intel’s
recent 48-core SCC [21] design, which uses regular cores, is reduced from 28% to 10%
of total system power compared to Intel’s older 80-core TeraFlops chip [36] which uses
simpler cores. Two types of power dissipation contribute to the overall power consumption
of a chip: dynamic and static. Dynamic power is dissipated when a transistor switches a
bit from a logic 1 to a logic 0 and vice versa. Static power is dissipated due to the leakage
current of a transistor even when it is off. In servers, static power contributes to 40% of the
overall power consumption [4]. Power optimizations of the NoC fabric are a critical piece
of the puzzle to sustain and continue the drastic growth in CMP performance [2, 3].
Buffers consume significant dynamic power when traffic load is high as well as static
power due to leakage [2, 3]. In order to reduce this high power consumption, some
designs have moved buffers from the router to the channels by replacing repeaters on the
channel with tri-state repeaters called channel buffers [2, 37–40]. These channel buffers
can store packets when the register buffers are congested or propagate data forward when
necessary, thereby mitigating power and area penalties associated with router buffers [2, 3].
Figure 1.3(a) and (b) shows the total power breakdown (in mW) for a 5x5 router from
Synopsys Design Compiler using the TSMC-LPBWP 40 nm technology library with a
nominal supply voltage of 1.0 V and an operating frequency of 2 GHz. The dynamic
power breakdown in Figure 1.3(a) shows that buffers consume 33% of router power
(buffers+crossbar). With the same amount of buffer space, channel buffers can lower
21
0
20
40
60
80
100
120
140
Crossbar Buffer Channel
Buffer
P
ow
er
 (m
W
) 
(a) Dynamic Power (b) Leakage Power 
0
2
4
6
8
10
12
14
16
18
20
Crossbar Buffer Channel
Buffer
P
ow
er
 (µ
W
) 
Figure 1.3: (a) Dynamic and (b) Leakage power of router crossbar, conventional buffers,
and channel buffers [2] ©2014 IEEE [3] ©2015 IEEE.
dynamic power by 90%. Figure 1.3(b) shows the leakage power breakdown of the router
components in µW. As shown, the leakage power of buffers is 68% of the total router
leakage power (buffers+crossbar). Channel buffers dissipate more leakage power than the
register buffers; however, this increase can be compensated by the very low dynamic power
[2, 3].
This high dynamic and static power consumption in NoCs has motivated architects to
implement buffer optimization techniques such as elastic buffering (ECB) [37, 38, 41] and
bufferless routing [42, 43]. ECBs store data on the link similar to channel buffers; however,
ECBs replace link repeaters with flip-flops which are used for storage and propagation.
In other work, by completely eliminating buffers and implementing bufferless routing,
recent work has reduced the average network energy by 40% [42]. In addition to buffer
optimizations, there has also been some research into optimizing crossbars for power.
These designs separate the big, unified crossbar into several smaller, separate crossbars
[39].
22
Table 1.1: Work related to buffer power optimizations.
Buffer Design Description Advantages Challenges
iDEAL [37]
Tri-state
link buffers
Reduced power
and area
No HoL
avoidance
ECB [38]
Flip-flop
link buffers
HoL avoidance
Performance
limitations
FlitBLESS [42]
& SCARAB [43]
Bufferless -
deflects/drops pkts
Reduced power
and area
High-speed route
computation logic
4S, 2S,
and 1S [39]
Multiple channel
tri-state link buffers
Reduces HoL blocking,
power, area, & perfom.
Diff. power, area.
& perform. trade-offs
Table 1.2: Work related to crossbar (xbar) power optimizations.
Crossbar Design Description Advantages Challenges
RoCo [44] Row/Column xbars Small 2x2 xbars Restricted routing
Mora et al. [45] Bisects output ports Two separate xbars
Focuses on high
radix xbars
1XB, 2XB,
and 4XB [39]
Trans. gates
or separate xbars
1XB - Performance
2XB/4XB - Power & area
1XB - Area
2XB/4XB - Routing
Table 1.1 summarizes the related buffer power optimization work. The iDEAL, 4S, 2S,
and 1S use channel buffer via repeater optimization whereas ECB uses flip-flop link buffers.
FlitBLESS and SCARAB are bufferless designs. Table 1.2 summarizes the related buffer
power optimization work. RoCo, the work of Mora et al., 1XB, 2XB, and 4XB designs are
crossbar optimizations; each having their own unique advantages and challenges.
23
1.3 Performance in NoCs
In addition to power, with the increasing number of cores, NoCs must manage the
communication demands. Since NoCs are designed to handle peak traffic loads, many
communication channels can go under-utilized when network load is high or the workload
is unbalanced. Examining a NoC router, the amount of traffic entering and leaving the
router will be similar when averaged across the whole application. However, due to
dynamic traffic patterns in NoC applications, there will be period of time where the majority
of traffic will be either entering or leaving the router. This unbalanced traffic can cause
certain links to become under-utilized during certain epochs. Figure 1.4 shows the link
utilization of a router in a 64 core network for two real applications. Each side of the
router (+x, -x, +y, and -y) has a link going in and out [2, 3]. We examine two multicore
applications: FMM from the SPLASH-2 benchmark suite [9] and blackscholes from the
parsec benchmark suite [10]. For both FMM and blackscholes, many links are under-
utilized, thereby, wasting bandwidth. For example, on the +x side of the router, the “in”
channel utilization is approximately double the “out” channel utilization. On other links,
the “in” utilization is much lower than the “out” utilization. Using reversibility, links can
change direction providing bandwidth where needed. Channel buffers can reduce dynamic
power while marginally increasing leakage power; and reversible channel could maximize
resource utilization and improve execution time, but would need fault tolerant techniques
to overcome the higher fault rates observed in channel buffers [2, 3].
Recent research on NoC performance has tackled the above mentioned problems using
techniques such as reversibility or coding schemes [46–51]. Hesse et al. propose a
bandwidth-adaptive router (BAR) that aims to take advantage of these under-utilized links
with bidirectional, adaptive channels [47]. These bidirectional channels adapt channel
bandwidth at a fine-granularity according to network traffic demands [2, 3]. BAR is a non-
fault tolerant network but increases channel utilization by using narrower channels while
24
0
1
2
3
4
5
6
in out in out in out in out in out in out in out in out
+x -x +y -y +x -x +y -y
C
ha
nn
el
 U
til
iz
at
io
n 
(%
) 
FMM blackscholes 
Figure 1.4: Link utilization of a router on two multicore applications from the SPLASH-2
and Parsec benchmark suites [2] ©2014 IEEE [3] ©2015 IEEE.
also improving performance through adaptive bidirectional channels. Research has shown
that channel reversibility can achieve higher throughput and lower average packet latency
in NoCs [2, 3].
1.4 Reliability in NoCs
In addition to power and performance, fault tolerance is another concern in NoCs [2, 3].
The scaling of transistors has enabled the integration of billions of transistors on a chip.
However, future CMPs are in jeopardy due to the reliability of these transistors. As shown
in Figure 1.5, the aggressive scaling of transistors will lead to an increase in faults in the
NoC. Faults can occur on the links or in any router component. It is critical to understand
the causes of faults so that the NoC can effectively handle errors.
There are two types of faults: permanent faults, or hard faults, and transient faults,
or soft faults. Devices that have permanent faults will always output unreliable data
throughout the life-time of the chip. Many permanent faults are a result of (i) severe
device wear-out (hot carrier injection (HCI), negative-bias temperature instability (NBTI),
25
• Technology scaling, wear-out, device aging, etc. 
• Leads to faults 
6 7 8 
3 4 5 
0 1 2 
Router 
 
 Link 
 
 
Fault 
 
 ? 
 
 
Packets must find 
fault-free path 
Figure 1.5: Transistors scaling along with device wear-out, device aging, etc. can lead to
faults in NoCs. Modern NoCs must have mechanisms in place to overcome such faults.
and electromigration (EM)), (ii) transistor infant mortality (early transistor failing or
accelerated aging), and (iii) manufacturing or variation-induced defects (optical proximity
effects, airborne impurities, processing defects) that escape initial testing. HCI occurs when
carriers have enough energy to be injected into the gate oxide of the transistor [5, 52]. This
results in gradual degradation of transistor parameters such as switching frequency which
can eventually reduce the lifetime of the device [5, 52]. HCI effects both nMOSFETs
and pMOSFETs and can occur each time a transistor switches as shown in Figure 1.6.
Therefore, the more a transistor is switched, the shorter the lifetime. NBTI, on the other
hand, only occurs to the pMOSFET while there is a “0” voltage on the gate as shown in
Figure 1.6. However, similar to HCI, NBTI can also cause gradual parameter shifts of the
transistor. NBTI reduces carrier mobility resulting in less current to drive the transistor
26
(a)
(b)
Figure 1.6: HCI effect during switching of both nMOSFET and pMOSFET and NBTI effect
on pMOSFET for (a) high switching activity, low duty cycle and (b) low switching activity,
high duty cycle (Figure from [5]).
and higher a threshold voltage which can result in unreliable output [5]. As shown in
Figure 1.6, longer duty cycles increases the effects of NBTI and results in a shorter device
lifetime. Finally, EM is the result of the momentum of electrons in the interconnects being
transferred to the ions in the surrounding material. This can eventually lead to open circuits
over time which results in shorter lifetimes of devices.
On the other hand, devices that have transient faults are more difficult to handle since
the device may operate error-free sometimes but can cause errors at other times. Transient
faults can be caused by stringent timing constraints, voltage spikes, electrostatic discharge,
thermal noise, crosstalk, and electromagnetic interference. Device wear-out can also cause
27
transient faults because as the device becomes worn-out and slower, the probability of
timing errors increases. Faults in NoCs may not completely cripple the CMP but can cause
loss of packet data, misrouting, deadlocks, misallocations, and network distributions. Both
permanent and transient faults can lead to increased execution time, excessive delays and
increased power consumption while recovering from the fault [44]. Therefore, it is critical
that NoCs implement fault tolerant techniques to address the reliability concerns of future
CMPs.
The extreme shrinking of transistor feature sizes has made NoCs vulnerable to failures
and data corruption [2, 3]. To examine the number of link faults in a NoC, a fault model was
used which was similar to the model used in [6] in which a router design consists of 20,413
gates. Faults were injected randomly and weighted by the size of the gates. Therefore,
gates with a larger number of transistors have a higher probability of failing. Figure 1.7(c)
shows the number of faulty links caused by gate failures for reversible channel buffers,
non-reversible channel buffers, and conventional links without channel buffers [2, 3]. Non-
reversible channel buffers were briefly described in Section 1.2 and will be described in
more detail, along with reversible channel buffers, in Section 2.1. Non-reversible channel
buffers are less reliable than conventional links due to the extra two transistors added to
each link. Reversible channel buffers are even less reliable because of eight additional
transistors. Therefore, robust fault tolerant techniques are even more critical when using
channel buffers [2, 3].
Recently, researchers have been tackling channel failure in NoCs [6, 7, 53–58]. Built-
In Self Tests (BISTs) are commonly used to detect errors in systems [2, 3]. Recently,
NoCAlert [53] was proposed which detected faults in real-time with 0% false negatives.
Low overhead checkers were used to detect faults without the need of periodic or triggered-
based testing [2, 3]. Orthogonal to error detection, there has been some research into fault
tolerant architectures. The Ariadne [6] network, shown in Figure 1.8(a), uses up*/down*
28
0
10
20
30
40
50
0 20 40 60 80 100
Li
nk
 F
au
lts
 
Gate Failures 
Rev. Channel Buf.
Channel Buf.
Conv. Links
Figure 1.7: Link failures with reversible channel buffers, with conventional channel buffers,
and without channel buffers [2] ©2014 IEEE [3] ©2015 IEEE.
routing to move around faults. Each time a fault was detected, new routing paths were
created by transmitting a series of flag broadcasts to all routers. This created a deadlock-
free tree network for the irregular topology [2, 3]. The Vicis [7] network, shown in Figure
1.8(b), also changes its routing algorithm to move around faults when detected. To avoid
deadlocks, turn restrictions are placed at certain routers [2, 3]. The BulletProof architecture
[59] concentrates on the router (not the channels) and provides efficient fault tolerance
schemes for routers to overcome transient and permanent faults. The Immunet [54] design
avoids faults by adaptively routing packets while using escape VCs to avoid deadlocks.
1.5 Proactive Fault-Tolerant Techniques and Machine Learning
Typical techniques for handling faults are often reactive which implies that they respond
to faults after the error has already occurred. Reactive fault handling techniques are not the
most optimized methods because they are employed after errors have already affected the
29
Fault-Tolerant Network: Ariadne Fault-Tolerant Network : Vicis 
(a) (b) 
Figure 1.8: Other research related to fault tolerant NoCs: (a) Ariadne [6] and (b) Vicis [7].
performance of the system. Some reactive techniques route around faults after they have
occurred [6, 7, 44, 54]; some use reconfigurable links to avoid faults [2, 60, 61]; while
others detect all errors after they occur [53, 62]. On the other hand, proactive techniques
prevent faults before they occur or reduce the probability of a fault affecting the packet.
Proactive fault tolerant schemes can be more beneficial because they are employed before
the error affects the system. Some proactive techniques include load balancing routers
and links to prevent device wear-out [5], re-routing in dynamic voltage scaling systems
to avoid low-voltage routers [63], and prediction of timing-critical instructions [64] or
program phases [65]. Unlike previous fault-tolerant designs, we use machine learning (ML)
to precisely pinpoint faults through a statistical approach.
ML is the construction of programs which make predictions and improve with
experience, or learn. ML can be broadly categorized into three types: supervised learning,
unsupervised learning, and reinforcement learning. In supervised learning, programs learn
from labeled examples. In contrast, unsupervised learning programs learn from unlabeled
examples. Lastly, reinforcement learning programs learn with delayed feedback. In this
dissertation, the focus is supervised learning. The task of supervised learning is to learn
30
a function which maps input instances to output targets. This can be done in two ways:
classification and regression. In classification, the output target belongs to a finite set
of discrete categories. In regression, the output target is continuous. In this dissertation
and many network/multicore applications, the task is classification as we desire a discrete
output such as “Fault” or “No Fault”.
ML has been used in network/multi-core applications other than fault prediction for
various applications such as optimizing wireless sensor topology [66], detection of false
memory sharing [67], management of network power [68], branch prediction [69, 70],
and predicting network hotspots [71]. Different ML algorithms are implemented in these
designs such as decision trees (DTs) and artificial neural networks (ANNs). ANNs are
based on biological neural networks. Nodes are interconnected by weighted connections
which can be varied through the learning process. ANNs often use online learning to
achieve high accuracy (although, ANNs can also use oﬄine learning); however, learning
during runtime results in more overhead and possible delays. Also, the implementation cost
of ANNs in hardware can be high due to the complexity of multiplication. DTs can also use
oﬄine learning and have a small implementation overhead. A DT is a flowchart-like model
which consists of internal nodes, branches, and leaf nodes. An internal node performs a test
on an attribute, also called a feature (e.g. whether or not the chip temperature is high). A
branch represents an outcome of the test, also called a feature value (e.g. low temperature).
Finally, the leaf node at the end of the tree represents a class label (e.g. No Fault). The
exact structure of a decision tree is determined in the learning, or training, process which
can be done oﬄine using the ID3 algorithm [72]. After the decision tree is built, it can be
implemented on the chip and used in online testing of new cases, or samples. Testing in
DTs simply involves testing features of the sample at each internal node and traversing the
branches until a leaf, or decision is reached. The model we use for predicting traffic and
predicting faults is a decision tree due to the simplicity during the testing phase.
31
The DT model, as well as most ML models, requires a data set for training and testing.
A data set consists of a list of samples. Each sample must contain feature values and a class
label. Some of the features will be used as the internal nodes of the DT but not necessarily
all the features. Feature selection methods can be used to narrow down the number of
features. In this dissertation, we modify the ID3 algorithm so that feature selection is
performed while training the decision tree. The class label indicates the true classification
of the sample (e.g. No Fault). The data set is typically separated into a training set and a
testing set. The model uses the training set to perform the learning process (e.g. building the
tree). Without a suitable training set, learning must be done online which can greatly add to
the overhead of the predictions. After a model is trained on the training set, predictions are
made against the test set to determine the performance of the model. It is important that the
data sets accurately represent characteristics of the real testing environment. Therefore, the
design of data sets, specifically for fault prediction, is another important problem addressed
in this dissertation.
1.6 Major Contributions
The overall goal of this dissertation is to provide a comprehensive, proactive fault-
tolerant technique to overcome both transient and permanent faults while maintaining
low power and performance overheads. This dissertation proposes to (1) overcome faults
by providing redundant paths and reversible links (2) predict when to reverse links for
performance enhancements, and (3) proactively mitigate errors by using prediction to
localize the occurrence of transient faults. The idea of this dissertation is to rethink how
errors are tackled in NoCs. Instead of traditional reactive techniques that increase both
the response time as well as delay applications when faults manifest, this dissertation
proposes to proactively mitigate faults before they can cause errors and affect the system
performance. Most previous research in fault tolerance such as error correction codes
32
(ECC) take a reactive approach to handling faults. ECC add overhead to the data so that
errors can be detected and then corrected to reconstruct the original, error-free data. By
using prediction, errors can be avoided before they occur and affect the performance of the
system. The three major contributions are:
• Contribution 1: Overcome faults by providing redundant paths and reversible links.
• Contribution 2: Predict when to reverse links for performance enhancements.
• Contribution 3: Proactively mitigate errors by using prediction to localize the
occurrence of transient faults.
1.7 Organization of Dissertation
The rest of this dissertation is organized as follows: Chapter 2 describes the fault
tolerant QORE architecture (contribution 1). QORE uses multi-function channel buffers
to overcome permanent faults by providing redundant paths with reversible links. In
Section 2.5 an enhancement to QORE is described which improves the reversible links
using machine learning techniques (contribution 2). Next, in Chapter 3, a full design
to predict and mitigate soft errors is described (contribution 3). Section 3.1 details the
design of the data sets used for prediction, Section 3.2 describes the machine learning
algorithm implemented to predict faults, and Section 3.3 describes how faults are mitigated
in the design. Then, in Chapter 4 the network performance, power, area, and additional
overheads are evaluated for both the QORE architecture (Section 4.1) and the prediction
and mitigation of soft errors (Section 4.2). Finally, the dissertation is concluded in Chapter
5 and future work is discussed.
33
2 QORE: A Fault Tolerant NoC Architecture
In this chapter, we overcome permanent faults while improving power and performance
in the QORE architecture. QORE is a fault tolerant network-on-chip architecture with
power-efficient Multi-Function Channel (MFC) buffers. QORE can lower power with
channel buffers, improve performance through reversible links, and improve fault tolerance
through redundant links. The key components in QORE are the MFCs (explained in
Section 2.1) which have multiple functionalities: on-demand data storage, on-demand
forward data propagation, and backward data propagation. On-demand data storage
enables communication channels to act as buffers and store data when the network load
is high and function as repeaters when the network load is low. Therefore, MFCs have four
possible states: forward propagation, backward propagation, forward buffer, and backward
buffer. Using multiple links with the MFCs and the associated control blocks, QORE
can improve both performance and fault tolerance. The multiple links between routers
can provide data with redundant paths in case of faults. Increasing the number of links
between routers can improve fault tolerance; however, the bandwidth of the links decreases.
Therefore, there is a trade-off between fault tolerance and performance as the number of
links varies. We evaluate this trade-off by varying the number of links from two to eight in
Section 4.1. Additionally, we use machine learning techniques (Section 2.5) to accurately
determine how to reverse the various links in QORE. The QORE architecture attempts to
address three issues of power, performance and fault-tolerance in a cohesive manner [2, 3].
2.1 Multi-Function Channel Buffers
In this section, we will explain the circuit and implementation details of the MFC
buffers [2, 3]. Channel buffers have been shown to eliminate router buffer power by moving
storage to the channels with the side benefit of reducing the area overhead with marginal
performance penalty [37, 38]. In this work, we uniquely modify the previously proposed
34
channel buffers to function as bidirectional channel buffers with similar advantages of
reduced power while providing on-demand storage. Figure 2.1(a) shows two physical
channels with four channel buffer stages per channel. The inset shows a conventional
channel buffer which uses four transistors and a release (rel) control line to store or
propagate packets in one direction [2, 3]. The working of channel buffers to either store or
propagate packets based on router congestion and receive signals via a control block has
been discussed previously [37]. Our reversible channel buffer circuit is shown in Figure
2.1(b). By adding eight transistors to act as four transmission gates, the channel buffers
can propagate packets in both directions in addition to storage. The four transmission gates
are controlled by the reverse signal (rev) sent from the router. A table showing all possible
functions of the reversible channel buffer based on the inputs rel and rev are also shown
in Figure 2.1(b). Figure 2.1(c) shows various combinations of reversible channel buffer
functionalities; either as on-demand storage or repeater, and with data propagating either
in forward or backward directions [2, 3].
• Forward Buffer: When rel=1 and rev=0 data can be stored in the forward direction
(left to right). The data is cut off from Vdd and GND and the data is stored on the
capacitance of the transistors [2, 3].
• Backward Buffer: When rel=1 and rev=1 data can be stored in the backward
direction (right to left). Again, the data is cut off from Vdd and GND and the data is
stored on the capacitance of the transistors [2, 3].
• Forward Propagation: When rel=0 and rev=0 data can propagate forward. The
transistors connected to Vdd and GND are enabled to allow propagation and the
forward propagation transmission gates are also enabled [2, 3].
35
rel 
rel 
rev 
rev 
rev 
rev 
rev’ 
rev’ 
rev’ 
rev’ 
Conventional 
channel buffer 
Reversible channel buffer 
(a) (b) 
rel rev Function 
0 0 Forward 
0 1 Backward 
1 0 Store 
1 1 Store 
rel=0 
rev=0 
rev=0 rev=0 
rev=0 
rev’=1 
rev’=1 
rev’=1 
rev’=1 
Forward Propagation (rel=0, rev=0) 
off 
off 
rev=1 
rev=1 rev=1 
rev=1 
rev’=0 
rev’=0 
rev’=0 
rev’=0 
Backward Propagation (rel=0, rev=1) 
off 
off 
(c) 
rel=0 
rel=1 
rev=0 
rev=0 rev=0 
rev=0 
rev’=1 
rev’=1 
rev’=1 
rev’=1 
Forward Buffer (rel=1, rev=0) 
off 
off 
rev=1 
rev=1 rev=1 
rev=1 
rev’=0 
rev’=0 
rev’=0 
rev’=0 
Backward Buffer (rel=1, rev=1) 
off 
off 
rel=1 
off 
off 
off 
off 
Figure 2.1: (a) Conventional channel buffer, (b) our reversible channel buffer, and (c)
storage and propagation for both forward and backward links [2] ©2014 IEEE [3] ©2015
IEEE.
36
• Backward Propagation: When rel=0 and rev=1 data can propagate backward.
Again, the transistors connected to Vdd and GND are enabled to allow propagation
and now the backward propagation transmission gates are enabled [2, 3].
Figure 2.2 shows the state diagram for the MFC control block. The MFCs switch
between buffering and propagation (forward or backward) when the release control signal
(rel) is set or cleared. For example, if the MFC starts in forward propagation then
congestion downstream can cause the rel signal to be set to 1. When rel=1 the MFC
will change from forward propagation to forward buffer. The MFC will stay in forward
buffering while rev=0 and rel=1. When the fault controller (explained in Section 2.3)
determines that a link should be reversed due to traffic patterns, the reverse signal (rev)
is set. Since the link is changing direction, the data that was already on the link must be
flushed out so that other data can be sent in the reverse direction. Therefore, when the
reverse signal changes from 0 to 1 or from 1 to 0, the MFC must change to the flush state
regardless of the previous state. In the flush state, the release signal is set so that all of the
data on the links is flushed into the router. Once the link is empty (empty=1), the MFC will
leave the flush state and proceed to any of the four other states depending on the rev and rel
signals.
We show four functions of our MFC for high network loads (forward and backward
buffers) and for low network loads (forward and backward propagation) [2, 3]. When our
MFCs act as buffers, the capacitance of the transistors must be large enough to store the data
for many cycles. Figure 2.3 shows the discharge time of a channel buffer implemented with
130 nm transistors using the Virtuoso Analog Design Environment from the Cadence tools.
As shown, the discharge time of the channel buffers is in the magnitude of milliseconds
which corresponds to millions of clock cycles with a 1 GHz clock [2, 3]. The rate of
capacitance discharge is governed by the rate equation dV/dt = Io f f /C where dV/dt is the
rate of discharge, Io f f is the leakage current and C is the capacitance. The leakage current
37
 Forward  
Buffer 
rel=1 rev=0 
Backward 
Buffer 
rel=1 rev=1 
Forward  
Propagation 
rel=0 rev=0 
Backward 
Propagation 
rel=0 rev=1 
Flush 
rev=1 
rel=1 & rev=0 
& empty=1 
empty=0/Flush 
rel=0 rel=1 
rel=0 
rev=0 
rel=1 & rev=1 
& empty=1 
rel=1 
rev=0 
rel=1 & rev=0 
& empty=1 
rev=0 
rel=0 & rev=1 
& empty=1 
rev=0 or rel=1 rev=1 or rel=1 
rev=1 or rel=0 rev=1 or rel=0 
Figure 2.2: State diagram for MFC control block.
will increase with technology scaling as transistors become smaller. The capacitance is
given by, C = (W × L)/t where  is the dielectric constant, W is the width, L is the
length and t is the thickness. As technology scales (depending on the design, one could
consider DRAM node scaling from ITRS roadmap), C will have a less than linear scaling.
Further, the clock rate is minimally increasing to compensate this increase. Therefore, the
term dV/dt will gradually increase (nominal) with technology scaling. However, this is an
important factor not only for the QORE architecture but for all ASIC designers. This
discharge rate is similar to what we will expect even in our repeaters or DRAM cell.
Therefore, chip designers will ensure that the capacitance does scale linearly to ensure
38
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0.0 0.5 1.0 1.5 2.0
Vo
lta
ge
 (V
) 
Time (ms) 
ctrl
in
out
Figure 2.3: Discharge time of channel buffer using 130 nm transistors in the Virtuoso
Analog Design Environment from the Cadence tools [2] ©2014.
that the discharge rate does not dramatically increase with technology scaling and ensure
that data can be stored for well over 100+ cycles.
2.2 MFC without Faults
Conventional routers, that use virtual channels (VCs) and fixed connections between
routers, can become a bottleneck if there is high traffic in any direction. To reduce the
buffering bottleneck, QORE uses our reversible channel buffers to dynamically allocate
buffers to adapt to traffic patterns. Figure 2.4 shows the links between routers in QORE. In
order to have the same amount of buffering as a conventional 4 VC/input router, we place
a set of N=4 links between routers each consisting of two channel buffer lines. Each link
consists of two channel buffer lines to alleviate HoL blocking [2, 3, 41]. Additionally, since
QORE has more links between routers than the two links in conventional routers, we have
reduced the bandwidth of our links for a fair comparison, as explained in the evaluation
section. Therefore, the wire area overhead of QORE is equal to the conventional baseline
39
networks. However, a designer can choose N to be a different number depending on system
requirements. Each router link is reversible, allowing communication in both directions.
However, the two channel buffer lines in each link will always be directed the same way.
This will ensure that at any time, a packet will have at least two VCs to choose from,
which in turn will alleviate HoL blocking. In QORE, when there is high traffic in one
direction, the links can change direction according to the traffic load, thereby increasing
buffer space. For example, in Figure 2.4, when there is high eastbound traffic, three links
(a-c) can be allocated to the east direction while one link (d) remains in the west direction.
The three east links can, therefore, use the under-utilized westbound buffers and provide
more buffering for eastbound traffic. This additional buffering will relieve congestion at
router 2 as well as router 1 and other upstream routers. Meanwhile, the one west link can
still provide buffering for westbound traffic. As a result, both eastbound and westbound
traffic can have ample buffering, thereby, decreasing packet latency. Therefore, reversing
router links in QORE can reduce traffic bottlenecks caused by under-utilized links and
buffers [2, 3].
Determining which direction to allocate links is critical in QORE. Network traffic is
measured using hardware counters to store the number of link traversals in each direction.
A two-stage controller, which is detailed more in Section 2.4, is used to allocate links to
the appropriate direction based on traffic demands. The first stage (link controller (LC)) of
the controller uses the counters to determine which direction has the highest traffic called
as the “majority”. The second stage (fault controller (FC)) will assign all but one link to
the majority direction or allocate equal links to both directions if the link utilizations are
similar. In the example in Figure 2.4, each time a flit traverses links (a-d), both routers 1
and 2 will increment their counters. Since there is high eastbound traffic in this example,
the link controllers will determine that the majority of the traffic is moving from west to
east. At this point, the fault controllers in both router 1 and 2 will allocate the first three
40
Xbar 
2 Channel Buffer 
Lines each with 
4 stages 
 
 
-x 
Xbar 
N=4 Router 
Links 
East 
West 
Router 1 Router 2 
a 
b 
c 
d 
Low Congestion More Buffers for East 
Figure 2.4: QORE’s four reversible router links each consisting of two channel buffer lines
[2] ©2014 IEEE [3] ©2015 IEEE.
links (a-c) to the east and allocate link (d) to the west. If there are packets currently stored
in the channel buffers when the reversing occurs, then these packets will be flushed out to
escape VCs inside the downstream router [2, 3].
2.3 MFC with Faults
QORE uses MFC buffers to overcome hard faults in the network. When a link in one
direction is faulty, another link can reverse its direction to overcome this fault. Figure 2.5
shows the overall layout of the QORE network for 16 routers and can be easily scaled
to large numbers. The routers are connected to each other in a grid-like fashion similar
to a mesh network. However, instead of the two unidirectional links between routers as
in a mesh, QORE has four, narrower reversible links between each router. Again, each
reversible link consists of two channel buffer lines. Also, the links are narrower than the
baseline links as explained in the Evaluation section so there is no area overhead. The
41
12 
FC 
13 
LC 
FC 
14 
FC 
15 
LC 
FC 
8 
FC 
9 
LC 
FC 
10 
FC 
11 
LC 
FC 
4 
FC 
5 
LC 
FC 
6 
FC 
7 
LC 
FC 
0 
FC 
1 
LC 
FC 
2 
FC 
3 
LC 
FC 
LC 
FC 
Router 
 
 
Link 
Controller 
 
 
Fault 
Controller 
 
 
Reversible 
Link 
 
 Backup 
Ring 
 
 
LC LC 
LC LC 
LC LC 
LC LC a 
+x 
 
 
-x 
 
 
+y 
 
 
-y 
 
 
b 
c 
d 
Figure 2.5: Layout of QORE showing links configured to an arbitrary traffic pattern [2]
©2014 IEEE [3] ©2015 IEEE.
additional links create redundant paths between routers to improve both performance and
reliability while avoiding HoL blocking. The link setup shown in Figure 2.5 is arbitrary;
each link can reverse in either direction depending on traffic demands [2, 3]. QORE also
has a backup ring network [73] which is used when there are a large number of faults that
potentially could isolate healthy routers. Each router has a link controller (LC) and a fault
controller (FC) (Detailed in Section 2.4) that analyze link utilization and determine which
links to reverse [2, 3].
Each set of four links can handle up to three faulty links before using the backup ring.
If a fault is detected in any of the links of a set, then the remaining non-faulty links will
point in the directions specified by the LC and FC. For example, suppose the four links on
the +x side of router 0 are initially setup as shown in Figure 2.5 with two links facing east
42
(E) and two links facing west (W). If faults are detected in both links a and b, then links c
and d can overcome these faults by setting their directions to E and W, respectively. This
will maintain connectivity between routers 0 and 1 so that packets can still be transmitted
to both sides. If three of the four links fail then the fourth link can be used to communicate
both ways since it is reversible. However, if all four links between two routers fail, then the
backup ring network must be used. The backup ring network consists of two unidirectional
rings, so that packets can traverse the shortest path, either clockwise or counterclockwise,
to their destination. For example, if all four +x links of router 0 fail and the destination is
router 5 then the packet will be routed on the ring network from router 0 to router 1, and
so on up to router 5. Once a packet is on the ring network, it must stay on the ring network
until it reaches its destination in order to avoid livelocks and deadlocks [2, 3].
2.4 Link and Fault Controllers
In order to keep track of the status of each link, Link Status Tables (LSTs) are
implemented in hardware. There are four LSTs per router in QORE; one for each set
of links. The naming convention is shown in the top portion of Figure 2.6. The set of links
on the right-side of a router are labeled as the +x links, links on the left are labeled -x links,
etc. Each set of links has a LST containing information about the links. Each table has as
many entries as links in each direction. In this dissertation, there are always four links in
each direction. Hence, n+x = n−x = n+y = n−y = 4 and each table has four entries [2, 3].
Each link in a specified direction has a unique identifier stored in the Link Address
field. Whether the link is facing in towards the router or out away from the router is
specified in the Direction field. This field will be read by the routing computation (RC)
to determine valid routing paths and will be set up by the algorithm in the FC. The Flit
Count data field stores the number of flit traversals on the link within the reconfiguration
window, Rw. These counters are read by the LC to determine traffic demands. Each counter
43
+x Link Status Table 
Link Address 
log2(n+x) bits 
Direction 
1 bit 
Flit Count 
log2( Rw ) bits 
Faulty 
1 bit 
Total Good 
Links 
log2(n+x) 
+xLnk[0] In/Out Count Yes/No 0-n+x 
+xLnk[1] In/Out Count Yes/No - 
+xLnk[2] In/Out Count Yes/No - 
+xLnk[n+x] In/Out Count Yes/No - 
-y Link Status Table 
-y Lnk[0] In/Out Count Yes/No 0-n-y 
-y Lnk[1] In/Out Count Yes/No - 
-y Lnk[2] In/Out Count Yes/No - 
-y Lnk[n-y] In/Out Count Yes/No - 
n+x entries 
n-y entries 
Router 
+xLnk[n+x] 
+xLnk[0] 
-xLnk[n-x] 
-xLnk[0] 
+yLnk[n+y] +yLnk[0] 
-yLnk[n-y] -yLnk[0] 
Figure 2.6: Link direction naming convention and link status tables which are located in
the router and used to store statistics on each local link [2] ©2014.
is incremented every time its corresponding link receives a flit and is decremented every
time its corresponding link sends a flit. The Faulty data field stores whether or not the link
is useable. This data field is read by the FC and RC. The field is set when its corresponding
link detects a fault [2, 3]. Detection of faults can be done by implementing BIST (Built-In
System Test) [53, 74]; however, fault detection is beyond the scope of this work. Finally,
each table stores the total number of working links which is set each time a fault is detected
[2, 3].
The block diagrams for the LC and FC are shown in Figure 2.7(a). The LC and FC are
split into four independent blocks corresponding to each direction (+x, -x, etc.). The inputs
of the LCs are the direction fields for each of the four router links. The output of the LCs
indicates which direction (N=north, E=east, S=south, W=west, or B=both) the majority of
44
+x LC 
count of link 0 
+x Majority (E, W, or B) 
+x FC 
-y Majority (N, S, or B) 
log2(n+x) Link 
Address 
Direction 
-y LC -y FC 
log2(n-y) Link 
Address 
+x 
-y 
Enable (end of Rw) 
count of link 1 
count of link 2 
count of link 3 
# of good links 
link 0 faulty? 
Direction 
No Faults 
 
 
One Fault 
 
 
Two Faults 
 
 
Three Faults 
 
(b) 
Faulty Link 
 
 
Good Link 
 
 
(a) 
link 1 faulty? 
link 2 faulty? 
link 3 faulty? 
Figure 2.7: (a) Block diagrams of link controller (LC) and fault controller (FC) and (b)
example of fault adaptability for a various number of faults [2] ©2014.
the flits were traveling during the last Rw cycles. If the traffic was roughly equal (within ∆
where ∆=5% of total flit traversals in this paper) then a B is output and an equal number of
links will face in each direction. The LC output gives a good measure on the traffic demand
so that link bandwidth can be properly allocated. The simple algorithm to determine the
majority of the +x (px) links is shown in Algorithm 1. We will show how to improve this
algorithm using machine learning in Section 2.5. At the end of Rw, the LCs total up the
counts from their corresponding LSTs. Since the counters are incremented when a flit is
received and decremented otherwise, a positive total would indicate the majority of the
traffic is moving ”West” for the set of +x links and a negative total would indicate more
”East” Traffic. At the end, the counts in the LSTs are cleared for next Rw. The majority
output is then fed to the FCs [2, 3].
45
Algorithm 1 Link Controller Pseudocode for +x (px) Links [2] ©2014 IEEE [3] ©2015
IEEE.
if(Enable){
for(all links 0 to n+x − 1)
total count = total count + pxLnk[i].count;
if(total count 0)
pxMajority = West;
else if(total count 0)
pxMajority = East;
else
pxMajority = Both;
clear all counts();
}
The inputs for the FC, shown in Figure 2.7(a), are the majority signal, the total number
of good links, and the fault status of each link. The FC determines the new directions for
each link by outputting their link address and updating the direction field in the LST. The
algorithm to determine the directions for the +x set of links is shown in Algorithm 2. If
the LC determines that the majority is E, then the FC will assign a majority of the links
to the E direction as shown in Figure 2.7(b). The FC also tries to maintain connectivity
by assigning at least one link to the opposite direction of the majority when possible in
the assign one link function. The remaining links are assigned to the majority direction
in the assign remaining links function. Both assign one link and assign remaining links
functions are optimized so that the number of links that change directions is minimized.
Therefore, with four links between routers, after FC at most two links will change direction.
This minimizes the number of flits that must be flushed. When there is only one non-faulty
link, then the FC must break connectivity and assign the link to the majority direction.
46
Algorithm 2 Fault Controller Pseudocode for +x (px) Links [2] ©2014 IEEE [3] ©2015
IEEE.
if(Majority of traffic is West){
if(pxLnk.totalGood == 1)
assign one link(West);
else{
assign one link(East);
assign remaining links(West);
}
} else if(Majority of traffic is East){
// Same as above except interchange West and East
} else if(Traffic is similar in both directions){
assign hal f links(West);
assign hal f links(East);
}
However, this will cause starvation as packets cannot be sent in one direction. We resolve
this by allocating 60% of Rw to the majority direction and reserve 40% of Rw to the opposite
direction. We chose 60% because our simulation results showed that this value gave the
best average performance over all the benchmarks [2, 3].
2.5 Improving Link Controllers with ML
In order to improve our link controllers, we propose to use ML techniques to predict
the traffic flow on the links. Our baseline link controllers only use link utilization from the
previous time window to predict the traffic for the next time window. For example, if the
majority of traffic was going east during the last reconfiguration window then the links are
adjusted so that more bandwidth is given to the east direction for the next reconfiguration
47
Feature 1 
Feature 2 Feature 3 Feature 4 
W B 
W=WEST 
E=EAST 
B=BOTH 
Feature 5 Feature 6 Feature 7 Feature 8 Feature 9 Feature 10 
W E B E W E B E E E W E 
Figure 2.8: An example decision tree [3] ©2015 IEEE.
window. However, with dynamic traffic patterns the direction of the traffic may drastically
change from one reconfiguration window to the next. If the links are incorrectly reversed
for the current traffic pattern then a performance penalty will occur. Therefore, we use ML
techniques to improve the accuracy of our LCs [3].
We use decision trees for predicting the direction of traffic in QORE due to the
simplicity during the testing phase. Testing in decision trees is simple because the
algorithm uses a few comparisons instead of more complicated operations such as
multiplication or addition. Training for the machine learning algorithm can be done oﬄine
so that it does not effect the performance of the applications. An example decision tree
is shown in Figure 2.8. Each node in the tree is a input, or feature, used to predict the
target output. The target output of a decision tree is a discrete class. In the case of x links,
our three output classes are EAST, WEST, or BOTH and for y links the output classes are
NORTH, SOUTH, or BOTH, indicting where the majority of the traffic is moving [3].
In order to train the decision tree, a set of input features must be engineered. Figure
2.9 shows a list of possible features that can be used to detect the traffic on the +x links of
router 0 (links a, b, c, and d in Figure 2.5). Note that these links can equivalently be labeled
as the -x links of router 1. The features include various link and buffer utilizations from
router 0 and surrounding routers. Initially, the features were selected based on intuition of
48
R0 +x Features 
• Link difference = (Pkts sent on +x) – (Pkts received from +x) 
• Request difference = (Requests sent on +x) – (Requests received from +x) 
• Response difference = (Responses sent on +x) – (Responses received from +x) 
• R0 core buffer utilization 
• R1 core buffer utilization 
• Core buffer difference = (R0 core buffer utilization) – (R1 core buffer utilization) 
• R0 +y buffer utilization 
• R1 +y buffer utilization 
• R1 +x buffer utilization 
• Buffer difference = (R1 –x buffer utilization) – (R0 +x buffer utilization) 
• Other direction difference =  
        (Pkts sent on others directions for R0) – (Pkts sent on others directions for R1) 
• R0 +y link difference 
• R1 +y link difference 
• R1 +x link difference 
• R4 +y link difference 
• R0 core pkt difference =  
        (Pkts sent by R0 cores on +x) – (Pkts received by R0 cores from +x) 
• R1 core pkt difference =  
        (Pkts sent by R1 cores on -x) – (Pkts received by R1 cores from –x) 
Figure 2.9: List of possible features for predicting traffic on the +x links of router 0 [3]
©2015 IEEE.
good predictors. For example, packets using the -x links of router 2 can possibly use the
-x links of router 1 in the future. The algorithm used to build the tree will refine our list
of features and use only the features which are most useful in predicting traffic direction.
Every Rw cycles the features will be collected, stored in expanded LSTs, and used to predict
the outcome of the next Rw cycles. Each LC will be implemented with a decision tree but
the features will vary slightly depending on the location of the links in the network [3].
We use the ID3 algorithm [72] to train the decision trees. The ID3 algorithm, shown in
Algorithm 3, uses a set of training data, D, and a set of features, F. At each node in the tree,
the algorithm finds one feature, X, which has the largest information gain. The training data
D is then partitioned so that all examples that have the same value Xi for feature X are put
into a new dataset Di. The ID3 algorithm is then recursively called on the new data set and
on a new set of features that is without feature X. The algorithm recursively builds the rest
49
of the tree with the terminating condition being all examples in D have the same label. The
information gain for feature X is defined in the following equation:
IG(X; D) = H(T ; D) − Σki=1
|Di|
|D| H(T ; Di) (2.1)
where T is the random variable corresponding to the outcome, k is the number of feature
values, and H is the entropy function. The entropy function is given by:
H(T ; D) = −Σi p(Ti)log(p(Ti)) (2.2)
where p(Ti) is the probability of label Ti in data set D.
In our DTs, we modify the terminating condition of the ID3 algorithm to limit the size
of the tree to three levels so that the number of comparisons during testing is reduced to
only three. Therefore, after the third level in the tree, a majority vote is performed on the
remaining samples to determine the label in the tree. The implementation of the design tree
has a minimal overhead as will be shown in Section 4.1.1 [3].
2.6 Router Architecture
Figure 2.10 shows the router microarchitecture of QORE. The four links to the left of
the router can act as outputs or inputs. When acting as an output, the data comes from
the crossbar and is demultiplexed onto the four channel buffers. As an input, the data is
multiplexed into the crossbar. After a signal is multiplexed, it is normally sent straight to
the crossbar. However, it can be sent to an escape buffer. This escape buffer is used to move
packets from the channel buffers when the links are reversed [2, 3]. They are also used to
avoid deadlocking [75] as explained in Section 2.7. When the escape buffers are full the
upstream router will receive a congestion signal and will not send packets to the channel
buffers; therefore, guaranteeing that the escape buffers will have enough room to flush out
the channel buffers. Four buffers are used because at most four channel buffer lines will
reverse direction. Therefore, the four escape buffers ensures that a packet will have a buffer
50
Algorithm 3 ID3 algorithm [72] used to build the decision tree [3] ©2015 IEEE.
ID3(Training data D, Feature F):
if all samples in D have the same label:
return a leaf node with that label
if levels in the tree equals 3:
return a leaf node with label chosen from a majority vote
let X∈F be the feature with the largest information gain
let R be a tree root labeled with feature X
let D1, D2, ..., Dk be the partition produced by splitting D on feature X
for each Di∈D1, D2, ..., Dk:
let Ri=ID3(Di, F-{X})
add Ri as a new branch of R
return R
to go to when the links reverse. Each time a flit traverses the links, the counters in the LSTs
are incremented or decremented based on the link direction. The inset in Figure 2.10 shows
the counter for link 0. When a flit traverses a link, it signals the counters and increments
the f litcount in the LST if the direction is in or decrements the f litcount if the direction is
out. The LC and FC blocks access information from the LSTs as described in the previous
section. The route computation (RC) is modified to determine which link to send data on in
addition to which direction to send the packet. The link decision is based on which link has
the lowest count in the LST. Therefore, the traffic will be spread evenly among the links.
The switching control (SC) sends the release (rel) and reverse (rev) signals to the channel
buffers. When there is contention at the crossbar or downstream router, the SC notifies the
channel buffers to store the data by setting the rel signal to 0. The SC also reads the LSTs
to obtain the rev signal, notifying the channel buffers of the correct direction [2, 3].
51
From Cores
To Cores
4 Esc. 
Buffers
Ring
LC FC
RCSC
Counter 
updates
Xbar
VCs
LSTs
Router
flit 
signal
Link 0 
dir
+
1
1
link 0 
count
-
Figure 2.10: Router Microarchitecture showing inputs/outputs, LC, FC, and RC [2] ©2014
IEEE [3] ©2015 IEEE.
2.7 Deadlock Avoidance and Reliability Concerns
In conventional NoCs, XY routing algorithm is used to avoid deadlocks by avoiding
turns (Y-to-X). However when links reverse, if not handled properly, there is a potential
for deadlock as communication in one direction can be cut off leading to starvation. In
QORE, we avoid deadlocks by a) maintaining connectivity, thereby, eliminating starvation,
b) using escape VCs to flush out channel buffers during reversing, and c) keeping packets
on the backup ring network until their destination is reached. To prove that our network is
deadlock-free we examine the three possible states of the N links between routers [2, 3]:
Case I: Zero to N-2 links are faulty. In order to prevent deadlocks, connectivity
must remain between routers. Th FC algorithm first assigns one link to the non-majority
52
direction then assigns the remaining links to the majority direction. This ensures that there
is always a connection in both directions. Conventional deadlock-free algorithms such as
XY routing can, therefore, be applied and deadlocks are completely avoided [2, 3].
Case II: N-1 links are faulty. Again, in order to prevent deadlocks, connectivity must
remain between routers. However, in this case only one link is available. The algorithm
of the FC will assign this one link to the majority direction. Then at 60% of Rw, FC will
change the Direction field in the LST to the opposite direction. This will cause the link to
flush out the data from the channel buffers to the escape VCs located at the downstream
router, thereby, allowing packets to be sent in the opposite direction. Therefore, 60% of Rw
will be allocated to the majority direction and 40% of Rw will be allocated for the opposite
direction, providing full connectivity [2, 3].
Case III: All N links are faulty. In this case, no channel buffers are available and protocol
states that packets must use the backup ring network to proceed. To avoid deadlocks and
livelocks, packets must remain on the backup ring network until their destination is reached.
We ensure the packet stays on the ring by adding a one bit ring field to the packet that
indicates to the RC whether or not the ring network should be used. When the ring bit is
”1”, the RC will always send the packet on the ring even if router links are available. If the
ring bit is ”0” then the router links must be used. To avoid circular dependencies once on
the bidirectional ring, a separate set of VCs is allocated to each direction [2, 3].
Protocol deadlocks can be avoided since each link has two buffer lines. One buffer
line can be assigned to requests while the other is used for response traffic. Other than
deadlocks and livelocks, another concern may be the issue of the fault tolerant components
themselves failing such as the backup ring network or the fault controllers. The backup
ring network adds redundancy to links between routers. Moreover, since this backup ring
network does not use reversible channel buffers, it has 10 less transistors at every repeater
creating a more robust connection between routers. For the LC, FC, and LSTs, since they
53
have a very small overhead, as shown in Section 4.1.1, these components would be ideal
for dual modular redundancy (DMR) or triple modular redundancy (TMR) [2, 3].
54
3 Prediction and Mitigation of Soft Errors in NoCs
In this chapter, we propose a comprehensive fault-prediction system in which we (a)
create a methodology to obtain the training/testing sets (Section 3.1), (b) train a ML
algorithm to predict timing faults on links (Section 3.2), and (c) mitigate soft errors (Section
3.3). All ML techniques require a set of data to train and test the chosen algorithm. Since
fault prediction with ML has never before been implemented for NoCs, we have developed
data sets based on several effects such as device wear-out and process-voltage-temperature
variation. From these data sets, we train decision trees which can predict timing faults
during runtime and produce several different outcomes each time a flit uses a link: none
of the bits will be in error, few bits will be in error (1-2 bits), or several bits will be in
error (> 2 bits). Based on the outcome of our predictor, we will decide not to apply any
mitigation, apply error correction code (ECC), or apply relaxed timing transmission. The
combined impact of prediction and mitigation is to reduce the retransmission of packets
and save energy when soft errors manifest in the NoC.
3.1 Design of Data Sets
To develop a data set, we first must create a fault model which can realistically produce
a probability of error in a system based on a set of parameters. Using this fault model, we
can then create a data set consisting of a large amount of samples by varying the model’s
input values. We can use this process to create different data sets for each link in the NoC.
Figure 3.1 shows the complete methodology involved in obtaining a data set including the
fault model shown in the dotted box. Subsection 3.1.1 explains our fault model which
consists of several realistic temperature, delay, and variation models. Subsection 3.1.2
explains the methodology used to obtain our data sets.
55
3.1.1 Fault Model
Our fault model, shown in the dotted box of Figure 3.1, correlates link utilization to
temperature and degree of wear-out to transistor delays. The corresponding temperature
and delays are then passed to the VARIUS model [76] which incorporates process and
voltage variations to determine the probability of timing errors.
The first input parameter for our model is link utilization. A high link utilization
implies an increased number of link and router traversals which lead to increased energy
consumption and higher temperature. Therefore, the probability of fault will increase
because gate delays become longer as temperature increases. Using a NoC fault model
[77] which integrates the HotSpot thermal model [78], we correlate link utilization to
temperature. We assume link utilization that ranges from 0.01 to 0.4 flits/cycle/node which
corresponds to temperature values ranging between approximately 75-104 Celsius. In
subsection 3.1.2, we will explain how the link utilization is varied.
The next input parameter of our model is device wear-out. Devices under long-term
stress suffer from effects such as wear-out and failure due to NBTI and HCI [5]. HBTI
and HCI shift transistor parameters over time and can be modeled through transistor
threshold voltage (Vth) [79]. The shift in threshold voltage causes increased transistor
delays according to the Alpha-power law [80]:
dg ∝ Vdd
µ(Vdd − Vth)α (3.1)
where dg is the transition delay, µ ∝ T−1/5 (where T is temperature), and α = 1.3. Therefore,
as devices become more frequently used, wear-out occurs which increases the delays of
transistors and devices become more susceptible to timing errors.
We consider a device to be a permanent fault when ∆Vth is greater than 10% [79].
Therefore, we define three values for wear-out which are below 10%: low (∆Vth=0-3.3%),
medium (∆Vth=3.3-6.6%), and high (∆Vth=6.6-10%). From Equation 3.1 we determine
56
Temperatures 
Delays 
Network 
Utilizations  
 
Process Variation 
(Internal parameter) Wear-out 
Raw Data 
Features 
& Labels 
Probability 
of Error (pe) 
Temp (°C) Load Wear-out Previous Error True Label 
Ti Ui Wi ETi ETi+1 
Ti+1 Ui+1 Wi+1 ETi+1 ETi+2 
… … … … … 
Features Labels 
𝑑𝑔 ∝
𝑉𝑑𝑑
𝜇(𝑉𝑑𝑑 − 𝑉𝑡ℎ)𝛼
 
NoC Fault 
Model (MIT) 
& 
HOTSPOT 
VARIUS 
Fault Model 
Cycle Temp (°C) Load Wear-out Error Type 
ti Ti Ui Wi ETi 
ti+1 Ti+1 Ui+1 Wi+1 ETi+1 
… … … … … 
Features i 
predict 
outcome i+1 
Figure 3.1: Our process to create features and labels from raw data collected by combining
several fault models [1].
the corresponding transistor delays for each wear-out value. Assuming Vdd = 1V and
Vth0 = 0.15V , the new delays (d) for each wear-out value based on the initial delays (d0)
are:
• Low: d = 1×d0 to 1.00592×d0
• Medium: d = 1.00592×d0 to 1.01190×d0
• High: d = 1.01190×d0 to 1.01796×d0
57
Based on the wear-out value, a range of delays can be calculated and the average delay can
be an input to the VARIUS model.
The final aspect of our fault model is the process variation which is integrated as
an internal parameter of the VARIUS model. Process variation has both a systematic
component which is spatially correlated and a random component which is not spatially
correlated. The systematic variation is modeled by using a function which relates parameter
correlation to distance. The relationship is negative and approximately linear. For example,
two points very close together have transistor parameters which are highly correlated and
two points very far away have very low correlation. At a distance of φ, two points are no
longer correlated. We use φ = 1 cm as experimental results show that gate lengths are
correlated up to approximately half the chip length [76, 81]. Therefore, depending on link
utilization, wear-out, and position on the chip, we can obtain a probability of error for each
link.
3.1.2 Data Sets
Next, we must create raw data by (a) inserting various link utilization and wear-
out values and (b) determining what type of error occurred. In order to obtain a wide
range of input values, we model an application, shown in Table 3.1, which ramps up
link utilization from 0.01 to 0.4 flits/cycle/node. We stop at 0.4 flits/cycle/node as most
networks saturate before this point and temperatures approach the bounds of normal
operating temperatures [77]. This model application is run three separate times for each
value of wear-out (low, medium, and high). First, to vary the link utilization, we start at
a load of 0.01 flits/cycle/node and maintain this load for 3,000 cycles. At each cycle, a
temperature is selected from a small range of temperatures based on normal bounds for on-
chip temperatures and on the model in [77]. For example, at a load of 0.01 flits/cycle/node,
a temperature would be randomly selected between 75 and 77 Celsius. After 3,000 cycles,
58
there is a ramp up period of 500 cycles in which the link utilization increases but the
temperature slowly ramps up. This is to capture the delay in temperature increase after
link utilization has been increased. Table 3.1 shows the change in link utilization and
temperature for the model application. This model application allows our data set to have
both a wide range of input values and a high number of samples.
The output of our fault model, as shown in Figure 3.1, is the probability of error (pe)
for a single bit line in each of the links. Our approach is generic and can be applied to
many different scenarios such as various link widths, number of links, and topology. Each
link has n data bit lines, where n = 64 in this paper, and there are a total of L links in the
network, where L = 48 in a 64 core concentrated mesh topology. Since all bit lines of a
link are relatively close, it is assumed that pe is the same for all n bit lines within a link.
However, each individual link of the L total links can have a different pe.
Figure 3.1 shows two samples in our raw data. For each cycle ti, a new sample is created
which consists of the input temperature (Ti), utilization (Ui), and wear-out (Wi) as well as
the error type (ETi). To determine the error type, pe is used to experimentally determine if
an error occurred in each bit line. Since there are n bit lines in a link, this creates a bit error
vector of size n, where the value of each index is 0 if there is no error and 1 if there is an
error. From this bit vector we can determine the error type (ET) by counting the number of
1’s in the bit vector: No error (0), Few Errors (1-2), or Many Errors (>2). A new sample is
created for every cycle in each run of our model application.
From the raw data, we develop our data set which consists of features and true labels.
Since we are interested in preventing errors before they occur, we must use currently
available information (features) to predict errors in the future (labels). Therefore, when
creating our data sets, we use inputs from cycle i as our features and outcomes from cycle
i + 1 as our true labels as shown in Figure 3.1. We can use the process described in this
section to obtain a data set for each link so that we can predict errors on each link separately.
59
Table 3.1: Model application link utilization and temperature pattern [1].
Load Temperature Duration
(flits/cycle/node) (Celsius) (cycles)
0.01 75-77 3,000
0.1 76-80 500
0.1 78-82 3,000
0.2 80-89 500
0.2 87-93 3,000
0.3 90-99 500
0.3 95-101 3,000
0.4 97-101 500
0.4 98-104 3,000
Finally, we randomly split the samples into training, testing, and validation data. The
training data contains 60% of the samples whereas the testing and validation data each
contain 20% of the remaining samples. Since the probability of error is typically very
small, our data set is quite skewed. For example, if the probability of error is 1% then on
average one sample out of every 100 will have an error. The skewed data set will make
the decision tree training more difficult. Therefore, to remedy the skewed data, we use a
common technique in which training samples are randomly replicated such that each label
has an equal amount of samples (50,000 samples/label). Now each label will be fairly
represented and machine learning algorithms can be used on the data as explained in the
next section.
60
3.2 Machine Learning
The complete list of features we consider in our design are: Temperature (T), Utilization
(U), Wear-out (W), and Previous Error Type (P). The feature values for Temperature are low
(≤84°C), medium (≥85°C and ≤94°C), and high (≥95°C). The feature values for Utilization
are low (≤0.1 flits/cycle/node), medium (>0.1 and <0.3 flits/cycle/node), and high (≥0.3
flits/cycle/node). The feature values for Wear-out are low, medium, and high as described
in Section 3.1.1. The feature values for Previous Error Type are “No Errors,” “Few Errors,”
and “Many Errors.”
Each feature can potentially have an effect on probability of error and we allow the
machine learning algorithm to decide which are the more useful features for predicting
errors. The features can be stored in a small table at each link. The values of some features,
such as Temperature and Wear-out, change slowly over time; therefore, these features do
not need frequent updating. Link temperature can be estimated by periodically gathering
data from coarse-grained temperature sensors on the chip. Since the work in [5] correlates
activity to wear-out, link wear-out can be estimated by keeping a coarse-grained counter
of packets that use each link. Link utilization provides more current information than link
wear-out and can be used in making fine-grained decisions. Finally, Previous Error Type
can easily be updated every time a prediction is made.
In our design, we use decision trees to predict errors on each of the links. Testing for
decision trees simply consists of a small number of comparisons. The maximum number of
comparisons equals the number of levels in the tree. In this paper, we choose the maximum
size of the tree to be three levels as shown in the example tree in Figure 3.2. This allows
our design to quickly predict a new outcome every cycle. Each node in the tree represents
a feature (X) and the arrows represent feature values (Xi). For example, a feature could
be “Temperature” and a feature value could be “low.” The leaves of the tree represent the
61
X 
ET ET ET 
X1 
X2 
X3 
ET ET ET ET ET ET ET ET ET ET ET ET ET ET ET ET ET ET ET ET ET ET ET ET 
ET ∈ {N, F, M} 
ET – Error Type 
N - No Errors 
F - Few Errors 
M - Many Errors 
Figure 3.2: Example of a generic decision tree with outcomes that determine the predicted
error type [1].
predicted outcome (ET) and can be any of the following: No Error (N), Few Errors (F), or
Many Errors (M).
As each link is at a different location in the topology, a separate decision tree is trained
for each link. Training the decision tree is done oﬄine and uses the same ID3 algorithm
as described in Algorithm 3 in Section 2.5. In the ID3 algorithm, the criteria for creating
nodes are information gains. Features with the largest information gain are selected as
nodes in the tree. The process of creating the decision tree is based on the set of training
data, D (previously described in Section 3.1.2), and the set of features, F (Temperature,
Utilization, Wear-out, and Previous Error Type). Recapping Algorithm 3, first, the ID3
algorithm is recursive with the terminating condition being all examples in D have the
same label. Next, the feature with the largest information gain is called X and R is the tree
root labeled with feature X. Then, the initial data set D is partitioned into k separate data
sets (D1,D2, ...,Dk) each corresponding to a value of feature X. Finally, for each data set
Di, the ID3 algorithm is recursively called passing it the partitioned data set Di and a set of
features excluding feature X (F − {X}). The tree root which is returned by the recursive call
(Ri) is added as a new branch to the original tree root R. After the ID3 algorithm is finished,
62
X=Temp. 
X1 
X2 
X3 
N 
Util. F 
Wear-Out Prev. Error 
M N N N 
N N F 
Figure 3.3: Example decision tree built using the ID3 algorithm.
a complete tree is built with nodes containing features that maximize the information gain
on the training set. A modification we add to the ID3 algorithm is that we modify the
terminating condition to limit the size of the tree to three levels so that the number of
comparisons during testing is reduced to only three, resulting in small energy and latency
[3].
Now we will detail the steps in building a decision tree in our design using an example
training data set. Figure 3.3 shows the fully built decision tree and Figure 3.4 shows the
steps in building this decision tree. The table titled D in Figure 3.4 is the example data set.
There are nine samples in this data set and each sample has feature values. Some feature
values are omitted by a dash (-) for the simplicity of the example. Also, each sample has a
true label for the error type (ET).
Following the steps in Algorithm 3, it is not true that all samples in D have the same
label so we can move on. Next, for the simplicity of this example, we assume Temperature
has the largest information gain; Therefore, we let X=Temperature in Step 1 of Figure
3.4. Temperature is now the root of the decision tree. Next, in Step 2a, D is partitioned
by splitting it on the feature values of Temperature (Low, Medium, and High). After the
63
F={T, U, W, P} 
F={U, W, P} F={U, W, P} F={U, W, P} 
X1=Low X2=Med X3=High 
X=ID3(D,F)= 
D1 
Samp.  
# 
T U W P ET 
1 Low - - - N 
ID3(D1,F)= N 
Temp 
D3 
Samp.  
# 
T U W P ET 
2 High - Low - N 
3 High - Med - N 
4 High Low High - N 
5 High Med High - N 
6 High High High - F 
ID3(D3,F)= Wear-Out 
N N 
Util. 
D2 
Samp.  
# 
T U W P ET 
7 Med - - N N 
8 Med - - F F 
9 Med - - M M 
ID3(D2,F)= 
Prev. Error 
N F M 
N N F 
D3 
Samp.  
# 
T U W P ET 
4 High Low High - N 
5 High Med High - N 
6 High High High - F 
F={U, P} 
ID3(D3,F)= 
Step 2a: 
Step 2b: 
Step 3: 
D 
Sample  
# 
T U W P ET 
1 Low - - - N 
2 Med - - N N 
3 Med - - F F 
4 Med - - M M 
5 High - Low - N 
6 High - Med - N 
7 High Low High - N 
8 High Med High - N 
9 High High High - F 
Step 1: 
Figure 3.4: Steps to build the example decision tree. Step 1: Feature with the highest
information gain becomes the root of the tree. Step 2a: Data set is partitioned on the
feature values of the root. Step 2b: Root feature is removed from the feature list. Step 3:
Recursively call ID3 on each newly partitioned data set. Repeat all steps until the tree is
completed [1].
64
splitting, there are now three tables showing each data sets, D1, D2, and D3. Also, since
Temperature was already used, we can remove it from the feature list in Step 2b.
Finally, Step 3 is to recursively call the ID3 algorithm for each newly partitioned data
set. For example, the ID3 algorithm will be used on data set D1 to find the node at the first
branch of the tree. Since all of the samples in this partitioned data set have the label N, the
label is returned and becomes a node in the tree. The ID3 algorithm will also be called on
D2 to find the node at the second branch in the tree. Since all of the samples do not have
the same label and we assume Previous Error has the largest information gain, Previous
Error becomes the next node in the tree. Steps 2a-3 are repeated: D2 is split, the feature set
is reduced further, and ID3 is called again which results in the labels N, F, and M. Finally,
ID3 is called on D3 which is the third branch corresponding to a high temperature. The
steps are repeated until all samples have the same label or there are three levels in the tree.
After all nodes in the tree have features or labels, the result is the completed tree in Figure
3.3.
3.3 Error Mitigation and Router Microarchitecture
Once the errors can be predicted, the next step is mitigation of the errors. We add
cyclic redundancy check (CRC) encoder blocks at the router ports coming from the cores
and CRC decoders at the router ports going to the cores. The top of Figure 3.5 shows an
example of a packet transmission using CRC. When a packet is injected into the network,
it is always encoded with CRC. Before the packet is ejected out of the network it is always
decoded and checked for errors. If errors are detected at a packet’s destination then the
packet must be retransmitted from the source.
We use the IEEE standard CRC-32 [82]. CRC-32 is a commonly used error detection
code capable of detecting all one to three bit errors, odd bit errors and a fraction of
burst errors up to the size of the 32-bit check code. Each 32-bit check code is produced
65
SRC 
SRC 
DEST 
DEST 
SRC DEST 
CRC: 
SECDED: 
Relaxed  
Transmission: 
CRC CRC 
Errors  
Detected 
SECDED SECDED SECDED SECDED 
Packet 
Retransmit Command 
Retransmitted Packet Few Errors 
Flit 
Errors  
Detected 
Cycle Delay 
Retransmitted Flit 
Link 
Figure 3.5: Examples of CRC, SECDED, and relax transmission mitigation techniques [1].
and appended to the tail of each packet as it is injected into the network, extending the
packet size by 32-bits to a total of 256-bits. When an error is detected anywhere in the
packet, we know the error exists but correction of the error is not possible with CRC
decoding. Mitigation of the error is concluded with successful packet retransmission. CRC
offers the most coverage in our network but has the highest overhead. Hence, CRC is
reserved only for use at the source and destination routers. Since the source can often be
several hops away from the destination, we will reduce the latency and energy overheads
of retransmission by error mitigation at the intermediate hops between the source and
destination.
66
At the intermediate hops, we use two forms of error mitigation: (a) hamming codes
which have single error correction and double error detection (SECDED) and (b) relaxed
transmission (RT). An example of packet transmission using SECDED is shown in the
middle of Figure 3.5. We encode at the output port of each router with SECDED, then
the flit traverses the link which is susceptible to faults, and finally we decode at the input
port of the next router. If errors are detected and cannot be corrected then a retransmission
must occur. However, only one flit must be retransmitted and the flit is only retransmitted
one hop as opposed to being retransmitted from the source. If an error can be corrected
then a retransmission can be avoided. For error correction, SECDED is implemented using
a (72,64) hamming code with a maximum error correction tolerance of 1-bit per flit. All
single bit errors in the 72-bit encoded flit will be corrected upon injection into the router.
Overall, SECDED has an energy overhead but only a latency overhead if a retransmit is
required.
The second form of error mitigation is RT and is shown at the bottom of Figure 3.5.
RT simply waits an additional cycle before reading the data at the input port; essentially,
giving the flit twice as much time to traverse the link. This will relax the timing constraint
on the transmission of the flit and reduce the probability of a timing error to very close
to zero. RT can be done by, first, stalling the flit and sending a signal to the input
demultiplexer of the downstream router. The downstream router is now notified to wait
two cycles before reading data. After the router is notified, the flit begins the two cycle
link traversal. Compared to a regular flit transmission, RT has no extra energy overhead
but has an extra latency overhead of two cycles: one cycle to signal the downstream router
and one additional cycle for data transmission. Overall, we can offer error mitigation on
a hop-by-hop basis using SECDED or RT and we can offer greater error mitigation from
source to destination using CRC.
67
Each error mitigation technique may have latency and energy overheads depending
on different outcomes. Table 3.2 summarizes the latency overheads of each mitigation
technique based on how many errors are there in the data, i.e. the true outcome. For CRC,
if there are any errors, then the full packet will be fully retransmitted from the source.
For SECDED, if there are no errors, then there will be no additional latency overhead.
However, if there are a few errors, then there will be two possibilities: (1) a single error
will be corrected and there will be no latency penalty, or (2) two errors will be detected in
which only one flit will be retransmitted for one hop. If there are many errors, then it will
only be detected by the CRC at the destination and a full retransmission will be required.
For RT, there will be a two cycle delay for any number of timing errors.
Table 3.3 summarizes the energy overheads of each mitigation technique. For CRC,
there will always be the energy to encode/decode as well as the possibility for additional
energy due to a full retransmit. When SECDED is employed, there will be the SECDED
encoder/decoder energy. Additionally, there can be an energy overhead from a either a full
retransmission or a one hop, one flit retransmission. The energy dissipation for a one hop
retransmission will be much lesser than the energy dissipation for a full retransmission.
Lastly, RT has no additional energy overheads for timing errors.
As each form of error mitigation has either a latency or energy overhead, we will
reduce these additional overheads by using prediction to dynamically enable and disable the
SECDED and RT. Figure 3.6 shows the router microarchitecture in our network including
the predictor blocks. There are also virtual channels (VCs) to buffer flits at the input ports
and avoid deadlocks. We use a crossbar to switch the flits from the input ports to the output
ports as well as output buffers to store packets at the output ports when needed. The router
is a typical credit-based, pipelined router with route computation (RC), virtual channel
allocation (VC), and switch allocation (SA) blocks. Routers can be connected to other
routers via links in any topology. The CRC blocks are shown at the cores as we employ
68
Table 3.2: Latency overheads of each mitigation technique given the number of timing
errors [1].
CRC SECDED RT
True
Out.
No
Errors
No
Overhead
No
Overhead
2 Cycle
Delay
Few
Errors
Full
Retrans.
No Overhead
or
1 Hop Re.
2 Cycle
Delay
Many
Errors
Full
Retrans.
Full
Retrans.
2 Cycle
Delay
Table 3.3: Energy overheads of each mitigation technique given the number of timing errors
[1].
CRC SECDED RT
True
Out.
No
Errors
CRC SECDED
No
Overhead
Few
Errors
CRC +
Full
Retrans.
SECDED
and/or
1 Hop Retrans.
No
Overhead
Many
Errors
CRC +
Full
Retrans.
SECDED +
Full Retrans.
No
Overhead
this error mitigation only at the source and destination. The SECDED encoders are at the
output ports and the SECDED decoders are at the input ports as this error mitigation is used
on a hop-by-hop basis.
69
Out0 
Outm 
In0 
Inn 
VCs 
VCs 
Router 
RC 
VA 
SA 
Crossbar 
(nxm) 
Predictor0 
Predictorm 
SECDED 
Encode 
Enable 
Enable 
Features 
Out0 T U W P 
… … … … … 
OutM T U W P 
From Cores To Cores 
Pred. ET 
Pred. ET 
Pred. ET 
Pred. ET 
CRC CRC 
SECDED 
Encode 
SECDED 
Decode 
SECDED 
Decode 
Figure 3.6: Router microarchitecture showing the feature table, predictors, encoders,
decoders, and CRC blocks [1].
Based on the output of the prediction algorithm, the network can choose when to enable
SECDED, when to enable RT, or when to disable both. The features previously explained
are stored in the features table shown in Figure 3.6. For each output port, the features table
has an entry which stores the feature values for temperature (T), utilization (U), wear-out
(W), and previous error type (P). Additionally, at each output port there is a predictor block
(predictor0 to predictorm) which implements each link’s unique decision tree using simple
comparisons. The predictor takes a table entry as input and then outputs the predicted error
type as well as an enable signal. Based on the predicted error type (ET) three different
actions can occur: 1) If ET=“No Errors” then no action is taken, 2) If ET=“Few Errors”
70
then the flit is encoded with SECDED before the link is used, and 3) If ET=“Many Errors”
then the flit is sent using RT. Therefore, if SECDED is used then the predictor enables the
encoder and forwards the predicted ET to the downstream router so that it knows to decode
the flit when it arrives. Similarly with RT, the predictor forwards the predicted ET to the
downstream router so that the input port knows to wait an additional cycle before reading
the data.
Figure 3.7 shows an example of packet transmission with and without prediction. With
both prediction and no prediction, we assume CRC is always enabled in order to detect
errors at the destination. Without prediction, SECDED and RT cannot be dynamically
enabled and disabled in an intelligent manner. Therefore, in this particular example we
assume SECDED is always enabled and RT is always disabled. With no prediction, energy
is dissipated at every hop due to the SECDED encoders/decoders. If no errors occurred,
then this energy dissipation was unnecessary. With prediction, when our design correctly
predicts that no errors will occur, then the SECDED encoders/decoders can be disabled
saving power. When a few errors are correctly predicted, the SECDED encoders/decoders
are enabled and a few bit errors can be mitigated in the same manner as if SECDED is
always enabled. Lastly, with no prediction, if many errors occur on the link, it will not be
detected until the packet reaches the destination. However, with prediction, if many errors
can correctly be predicted then RT can be employed possibly avoiding a full retransmission.
However, with any prediction technique there is also the possibility of misprediction.
If our design incorrectly predicted that no errors occur then this misprediction will lead to
a full retransmission. If our predictor incorrectly predicts that a few errors will occur then
the SECDED will unnecessarily be activated, wasting energy. Lastly, a misprediction of
many errors will lead to an unnecessary two cycle delay. Therefore, it is important that
our predictor performs well so that the advantages of correctly predicting outweigh the
misprediction penalties as evaluated in the next section.
71
SRC DEST No  
Prediction: 
SECDED SECDED SECDED SECDED CRC CRC 
SRC DEST Prediction: 
SECDED SECDED SECDED 
Many Errors  
Predicted 
CRC CRC 
Many Errors 
Few  
Errors  
Detected 
No Errors  
Predicted Disable Enable  Relaxed T. 
Packet 
Retransmit Command 
Retransmitted Packet Few Errors 
Cycle Delay Link 
Many  
Errors  
Detected 
Few Errors  
Predicted 
SECDED 
Enable 
Figure 3.7: Example of packet transmission with and without prediction [1].
72
4 Evaluation
In this chapter we will separately evaluate the QORE architecture which overcomes
hard faults in Section 4.1 and the fault prediction architecture which overcomes soft faults
in Section 4.2.
4.1 QORE Results
In this section, we first consider the overhead for our reconfiguration controllers and
reversible buffers [2, 3]. Next, we evaluate the fault tolerant performance of QORE
compared to the Ariadne [6], Vicis [7] networks by evaluating throughput and power on
synthetic traffic as well as speedup on real benchmarks. Next, we consider the effect of our
reversibility on the overall performance of QORE when no faults are present by comparing
to BAR [47] which is not a fault tolerant network [2, 3]. Lastly, we evaluate the accuracy
of our decision trees in prediction traffic.
For open-loop measurement, we varied the network load from 0.1-0.9 of the network
capacity. The simulator was warmed up under load without taking measurements until
steady state was reached. Then a sample of injected packets were labeled during a
measurement interval. The simulation was allowed to run until all the labeled packets
reached their destinations. All designs were tested with different synthetic traffic traces
such as Uniform Random (UR), non-uniform random (NUR), Bit-Reversal (BR), Butterfly
(BFLY), Matrix Transpose (MT), Complement (COMP) and Perfect Shuﬄe (PS) [2, 3].
For closed-loop measurement, the full execution-driven simulator SIMICS from Wind
River [83] with the memory package GEMS [84] was used to extract traffic traces from real
applications. The Splash-2, PARSEC, and SPEC CPU200 workloads were used to evaluate
the performance of 64-core networks. Table 4.1 shows the parameters for the cache and
core used for the Splash-2, PARSEC, and SPEC2006 benchmarks. We assume a 2 cycle
delay to access the L1 cache, a 4 cycle delay for the L2 cache, and a 160 cycle delay
73
Table 4.1: Cache and core parameters used for Splash-2, PARSEC, and SPEC2006
application suite simulation [2] ©2014 IEEE [3] ©2015 IEEE.
Parameter Value
L1/L2 coherence MOESI
L2 cache size/assoc 4MB/16-way
L2 cache line size 64
L2 access latency (cycles) 4
L1 cache/assoc 64KB/4-way
L1 cache line size 64
L1 access latency (cycles) 2
Core Frequency (GHz) 5
Threads (core) 2
Issue policy In-order
Memory Size (GB) 4
Memory Controllers 16
Memory Latency (cycle) 160
Directory latency (cycle) 80
to access main memory. The power and area results were estimated using the Synopsys
Design Compiler with the 40 nm TSMC technology library [2, 3].
For fair comparison, every network had 4 VCs per input and each network was assumed
to have a concentration of four cores to a single router as this has been shown to minimize
energy and latency while allowing a larger number of cores on a chip [85]. Additionally,
we maintained similar bi-sectional bandwidths for each network. The conventional router
design (both Ariadne and Vicis) will have two links between each router (one for each
direction) and QORE has at most six links between routers (four reversible links, two
unidirectional links for the ring) for a ratio of 1:3. However, the bandwidth of each link
74
Table 4.2: Power overhead for the components of one router [2] ©2014 IEEE [3] ©2015
IEEE.
Baseline QORE Percent Diff.
Storage 111.6 mW 19.8 mW -82.3%
LC 0 96.27 nW -
FC 0 96.64 nW -
Predictor 0 3.99 µW -
Link (2×96 bits) (6×32 bits)
307.2 mW 307.2 mW 0%
Crossbar (8×8) (9×9)
67.4 mW 86.2 mW +27.9%
Total 486.2 mW 413.2 mW -15.0%
in QORE is 32 bits/cycle so the total bandwidth between routers will be 192 bits/cycle.
Therefore, each link in the conventional design will be 192/2=96 bits/cycle which is 3X
the bandwidth of a QORE link. We have assumed that the backup ring network is fault-free
and the packet size is four flits each 128 bits [2, 3].
4.1.1 Power, Area, and Timing Overhead
Table 4.2 shows the power overhead for the network components of one router
estimated from the Synopsys Design Compiler with a nominal supply voltage of 1.0 V and
an operating frequency of 2 GHz. A buffer for the baseline design is a a four flit register
buffer and a buffer for QORE is a four stage reversible channel buffer. Each router, in either
design, contains 32 buffers (4 inputs × 8 buffers). The buffers in QORE consume 19.8 mW
of power; approximately 82.3% less than the baseline register buffers. The amount of
leakage power for the reversible channel buffer was found to be 2.44 nW. The overhead
of the LC and FC is approximately 96 nW of power and a timing of 0.07 ns. The power
75
Table 4.3: Area overhead for the components of one router [2] ©2014 IEEE [3] ©2015
IEEE.
Baseline QORE Percent Diff.
(µm2) (µm2)
Storage 43,712 147,392 +237.2%
LC 0 1.41 -
FC 0 1.42 -
Predictor 0 29.81 -
Link (2×96 bits) (6×32 bits)
23,629 23,629 0%
Crossbar (8×8) (9×9)
580,007 622,418 +7.3%
Total 647,348 793,471 +22.6%
overhead for the predictor was found to be 3.99 µW with a timing of 0.17 ns. The additional
power overhead of the controllers in QORE is a minimal fraction of the total router and the
timing is within our clock period. The link power for both baseline and QORE are equal
since the total link bandwidth is kept equal. An crossbar power overhead of 27.9% is due
to the backup ring network in QORE leading to a slightly larger crossbar [2, 3].
Table 4.3 shows the area overhead of each router component. The buffers in QORE
occupy 147,392 µm2 which is 3.4× more area than the baseline register buffers. However,
unlike register buffers and conventional channel buffers, our channel buffers serve three
functions: storage, reversibility, and a link repeater. The area overhead of the LC and FC
components are approximately 1.4 µm2 and the area overhead of the predictor is 29.81 µm2
which is minimal compared to the other router components. The timing for our reversible
channel buffers was estimated to be 0.39 ns which is within our specified clock period of
0.50 ns. The critical path of the four stage reversible channel buffers was composed of
76
eight pass gates (0.22 ns) and four non-reversible channel buffers (0.17 ns). The timing of
the critical path as well as estimate of power and area accounted for all additional wiring
required between routers [2, 3].
4.1.2 Speedup on Real Applications
The speedup of BAR (B), QORE (Q), Ariadne (A) relative to Vicis (V) for different real
applications is shown in Figure 4.1 and Figure 4.2. The networks were simulated on all
applications; however, we only show four applications in the figures. QORE reconfigures
its links every Rw = 50 cycles. Different values of Rw are evaluated in Section 4.1.6.
Before runtime, faults were randomly inserted into a percentage of links ranging from 0%
to 50%. Since BAR is not a fault tolerant network, it is only shown for 0% faults. At
0% faults, the performance optimized BAR has the largest speedup for all applications as
expected. At low to medium faults (0-30%), QORE has an average speedup of 1.68× across
applications for all benchmarks. At a high number of faults (40-50%), QORE has a worse
speedup of 0.51× on average. However, this can be misleading because the high number
of faults causes Ariadne and Vicis to be split into small subnetworks. Subnetworks are
very undesirable because cores from one subnetwork will not be able to communicate with
cores from another subnetwork. The average number of subnetworks for each network
is shown in Figure 4.3. QORE always maintains connectivity through the backup ring
network. The number of subnetworks in Ariadne and Vicis increase with the number
of faults. Subnetworks partition the chip, blocking communication to many cores. The
subnetworks, therefore, lead to a false increase in speedup as also observed in [6]. Whereas,
the reversibility of links makes QORE more resilient to communication blocking [2, 3].
4.1.3 Network Throughput
The saturation throughput of the networks for different synthetic traffic mixes is shown
in Figure 4.4. Four different types of traffic mixes we examine are shown in Table 4.4 using
77
FFM FFT 
0
0.5
1
1.5
2
2.5
3
3.5
BQAV QAV QAV QAV QAV QAV BQAV QAV QAV QAV QAV QAV
0% 10% 20% 30% 40% 50% 0% 10% 20% 30% 40% 50%
S
pe
ed
up
 
* * * * * * * * * * * * 
(a) 
*High number 
of subnetworks 
bzip freqmine 
0
0.5
1
1.5
2
2.5
3
3.5
BQAV QAV QAV QAV QAV QAV BQAV QAV QAV QAV QAV QAV
0% 10% 20% 30% 40% 50% 0% 10% 20% 30% 40% 50%
S
pe
ed
up
 
* * * * * * * * * * * * 
(b) 
Figure 4.1: Speedup relative to Vicis with varying number of faults where BAR (B), QORE
(Q), Ariadne (A) relative to Vicis (V) for 64 cores on (a) FFM and FFT apps and (b) bzip
and freqmine apps [2] ©2014 IEEE [3] ©2015 IEEE.
the abbreviations defined previously in Section 4.1; Each mix randomly cycles through
each pattern every TP=250 cycles. QORE reconfigures its links every Rw = 50 cycles
[2, 3].
78
0
0.5
1
1.5
2
2.5
3
3.5
BQAV QAV QAV QAV QAV QAV BQAV QAV QAV QAV QAV QAV
0% 10% 20% 30% 40% 50% 0% 10% 20% 30% 40% 50%
S
pe
ed
up
 
LU Ocean 
* * * * * * * * * * * * 
(a) 
*High number 
of subnetworks 
0
0.5
1
1.5
2
2.5
3
3.5
BQAV QAV QAV QAV QAV QAV BQAV QAV QAV QAV QAV QAV
0% 10% 20% 30% 40% 50% 0% 10% 20% 30% 40% 50%
S
pe
ed
up
 
streamcluster swaptions 
* * * * * * * * * * * * 
(b) 
Figure 4.2: Speedup relative to Vicis with varying number of faults where BAR (B), QORE
(Q), Ariadne (A) relative to Vicis (V) for 64 cores on (a) LU and Ocean apps and (b)
streamcluster and swaptions apps [2] ©2014 IEEE [3] ©2015 IEEE.
In Figure 4.4, QORE consistently has similar throughput to BAR and a higher
throughput than both Ariadne and Vicis. Averaged over each traffic mix and fault
percentage, QORE’s saturation throughput is 2.3× and 2.9× higher than Ariadne and Vicis,
79
0
2
4
6
8
10
12
14
10 20 30 40 50
N
um
be
r 
of
 S
ub
-N
et
w
or
ks
 
Percent Faults (%) 
Ariadne/Vicis
QORE
Figure 4.3: Average number of subnetworks of QORE compared to Ariadne and Vicis [2]
©2014 IEEE.
Table 4.4: Breakdown of traffic mixes [2] ©2014 IEEE [3] ©2015 IEEE.
Mix Patterns
Mix 1 BR, BFLY, COMP
Mix 2 NUR, BR, PS
Mix 3 UN, BFLY, MT
Mix 4 UN, BR, COMP, PS
respectively. Similar to the speedup results, an increase in throughput can be seen in all
mixes when the fault percentage changes from 20% to 30%. Again, this is due to link
faults causing the network to be partitioned into smaller subnetworks. When the faults
increase to a high percentage (40-50%), few flits are sent on a network that has many cores,
so the throughput (flits/cycle/core) starts to decrease again [2, 3].
From 0-20% faults, QORE only sees a drop in performance of 3.5% averaged over
all mixes compared to an approximately 70% drop for Ariadne and Vicis. QORE is able
80
Mix 1 Mix 2 
0
0.05
0.1
0.15
0.2
0.25
BQAV QAV QAV QAV QAV QAV BQAV QAV QAV QAV QAV QAV
0% 10% 20% 30% 40% 50% 0% 10% 20% 30% 40% 50%
S
at
ur
at
io
n 
Th
ro
ug
hp
ut
 
(f
lit
s/
cy
cl
e/
co
re
) 
(a) 
0
0.05
0.1
0.15
0.2
0.25
BQAV QAV QAV QAV QAV QAV BQAV QAV QAV QAV QAV QAV
0% 10% 20% 30% 40% 50% 0% 10% 20% 30% 40%50%
S
at
ur
at
io
n 
Th
ro
ug
hp
ut
 
(f
lit
s/
cy
cl
e/
co
re
) 
Mix 2 Mix 3 
(b) 
Figure 4.4: Saturation throughput for varying percentage of link failures for different traffic
mixes for BAR (B), QORE (Q), Ariadne (A) and Vicis (V) [2] ©2014 IEEE [3] ©2015
IEEE.
to sustain performance due to the adaptability of its links. When a wire between two
routers is faulty in Ariadne or Vicis, then all communication between those two routers
is blocked even if other wires are non-faulty. With many faults, this limits the number of
81
paths in the network. Therefore, many packets are sharing the same paths which causes a
drastic increase in contention for links. QORE, on the other hand, can overcome one or
more faulty wires by reversing the available non-faulty links. Reversibility preserves paths
between routers which relieves contention. Maintaining minimal contention for links is a
main factor for maintaining high throughput [2, 3].
4.1.4 Packet Latency
Figure 4.5 and Figure 4.6 shows multiple plots for the packet latency at various fault
percentages for traffic mix 1. At 0% faults, BAR saturates at the highest load due to its
adaptability, and fine-grained flit transmission. The low load latency for both BAR at
0% faults and QORE for all faults is higher than both Ariadne and Vicis. This is due to
the serialization delays combined with narrow links in BAR and QORE. However, QORE
saturates at a higher load for most fault percentages. At 10%, 20%, and 30% faults, QORE
saturates at least 77%, 160%, and 150% higher than Ariadne and Vicis. Faults in Ariadne
and Vicis can easily shut down communication between routers. The fault tolerant schemes
in these networks forces many packets to take additional hops to reach their destinations
because they must move around routers. The increase in hop count greatly increases packet
latency for the Ariande and Vicis networks. QORE is able to route more packets minimally
to their destination to keep latency low. At 50% faults, Ariadne and Vicis saturate 87.5%
higher than QORE. However, this is due to the many unreachable cores in Ariadne and
Vicis which create very small subnetworks resulting in packets with little to no contention
[2, 3].
4.1.5 Network Power
The total network power for the networks is shown in Figure 4.7 for different numbers
of link faults and all four mixes. For mix 1, QORE saves at least 15% power over Ariadne
and Vicis on average. Additionally, QORE saves 25%, 22%, and 23% on traffic mixes 2,
82
0
100
200
300
400
500
600
700
0.01 0.06 0.11 0.16 0.21
La
te
nc
y 
(c
yc
le
s)
 
Offered Load 
BAR
QORE
Ariadne
Vicis
(a) 0% 
0
100
200
300
400
500
600
700
0.01 0.06 0.11 0.16
La
te
nc
y 
(c
yc
le
s)
 
Offered Load 
QORE
Ariadne
Vicis
(b) 10% 
0
100
200
300
400
500
600
700
0.01 0.06 0.11 0.16
La
te
nc
y 
(c
yc
le
s)
 
Offered Load 
QORE
Ariadne
Vicis
(c) 20% 
Figure 4.5: Latency plots for traffic mix 1 for 0-20% faults [2] ©2014 IEEE [3] ©2015
IEEE.
83
0
100
200
300
400
500
600
700
0.01 0.06 0.11 0.16
La
te
nc
y 
(c
yc
le
s)
 
Offered Load 
QORE
Ariadne
Vicis
(a) 30% 
0
100
200
300
400
500
600
700
800
0.01 0.06 0.11
La
te
nc
y 
(c
yc
le
s)
 
Offered Load 
QORE
Ariadne
Vicis
(b) 40% 
0
200
400
600
800
1000
1200
0.01 0.06 0.11 0.16
La
te
nc
y 
(c
yc
le
s)
 
Offered Load 
QORE
Ariadne
Vicis
(c) 50% 
Figure 4.6: Latency plots for traffic mix 1 for 30-50% faults [2] ©2014 IEEE [3] ©2015
IEEE.
84
3, and 4. The main contribution to the power savings is the link power. Ariadne and Vicis
route around faulty links which many times leads to packets taking non-minimal paths to
the destination. QORE can avoid this as long as there is one working link between routers.
The only time QORE has the possibility to take a non-minimal path is when the backup
ring is used which, in this simulation, only occurred when the fail percentage was 50%. As
seen in Figure 4.7 for 50% faults, the power of QORE is higher than Ariadne because of the
backup ring routes packets on non-minimal paths for this particular traffic mix. BAR has a
power 9.5% less than QORE due to the backup ring in QORE which increases the crossbar
size by one. However, when the number of faults increases, QORE cannot be compared to
BAR since BAR is not a fault tolerant network. Therefore, QORE can save approximately
21% power on average while providing better fault coverage with a speedup 1.3× higher
and improved throughput by 2.3× [2, 3].
To examine the effect of channel buffers on QORE, Figure 4.8 shows only the buffer
power of a flit. Only traffic mix 1 is shown, other mixes showed very similar results. QORE
reduces buffer power by 53.2% on average. This savings is due to the low-power channel
buffers. The only time register buffers are used in QORE is when a packet is being sent
from a core, when a packet is on the backup ring, or when escape buffers are used. The
increase in buffer power at 50% faults for QORE is due to the increased use of the backup
ring [3].
4.1.6 Network Performance of Reversibility
We have shown that QORE can handle errors very well using reversibility. In this
subsection, we will show that QORE can overcome faults while maintaining performance
by comparing QORE to the non-fault tolerant, high performance BAR. In BAR, links are
reconfigured every cycle (Rw = 1 cycle). In this subsection, we evaluate the effect of a
longer Rw on QORE as well as the difference between the two networks [2].
85
Mix 1 Mix 2 
0
200
400
600
800
1000
1200
1400
BQAV QAV QAV QAV QAV QAV BQAV QAV QAV QAV QAV QAV
0% 10% 20% 30% 40% 50% 0% 10% 20% 30% 40% 50%
P
ow
er
 p
er
 F
lit
 (m
W
) 
(a) 
0
200
400
600
800
1000
1200
1400
1600
BQAV QAV QAV QAV QAV QAV BQAV QAV QAV QAV QAV QAV
0% 10% 20% 30% 40% 50% 0% 10% 20% 30% 40% 50%
P
ow
er
 p
er
 F
lit
 (m
W
) 
Mix 3 Mix 4 
(b) 
Figure 4.7: Network power for different traffic mixes for BAR (B), QORE (Q), Ariadne
(A) and Vicis (V) [2] ©2014 IEEE [3] ©2015 IEEE.
86
0
1
2
3
4
5
6
7
0 10 20 30 40 50
B
uf
fe
r 
P
ow
er
 p
er
 F
lit
 (m
W
) 
Link Faults (%) 
Bar
QORE
Ariadne
Vicis
Figure 4.8: Buffer power for traffic mix 1 [3] ©2015 IEEE.
Figure 4.9 shows the saturation throughput of QORE compared to BAR and a baseline
network which is QORE without reversibility. In the first two parts of Figure 4.9, TP=250
is the same and Rw varies from 50 cycles in Figure 4.9(a) to 100 cycles in Figure 4.9(b).
When Rw = 50 the saturation throughput of QORE is 2.5% less than BAR. When Rw
increases to 100 cycles, QORE has less opportunities to reconfigure and the performance
drop increases from 2.5% to 4.6%. In Figure 4.9(c) and 4.9(d) TP increases to 500 cycles
and Rw changes from 50 cycles to 100 cycles again. In this case, the performance drop
changes from 1.7% when Rw = 50 to 7.3% when Rw = 100. The uniform nature of mixes 3
and 4 give BAR a slight advantage over QORE since BAR reconfigures every cycle and at a
finer granularity. Compared to the baseline, QORE can improve throughput by an average
of 7.9% when Rw = 100 and the improvement can increase to 12.2% when Rw is changed
to 50 cycles. Overall, when Rw = 50 cycles QORE performance is only 2.5-4.6% lower
87
(a) TP=250 Rw=50 
0
0.05
0.1
0.15
0.2
0.25
mix 1 mix 2 mix 3 mix 4
S
at
ur
at
io
n 
Th
ro
ug
hp
ut
 
(f
lit
s/
cy
cl
e/
co
re
) 
BAR QORE Baseline
0
0.05
0.1
0.15
0.2
0.25
mix 1 mix 2 mix 3 mix 4
S
at
ur
at
io
n 
Th
ro
ug
hp
ut
 
(f
lit
s/
cy
cl
e/
co
re
) 
BAR QORE Baseline
(b) TP=250 Rw=100 
0
0.05
0.1
0.15
0.2
0.25
mix 1 mix 2 mix 3 mix 4
S
at
ur
at
io
n 
Th
ro
ug
hp
ut
 
(f
lit
s/
cy
cl
e/
co
re
) 
BAR QORE Baseline
(c) TP=500 Rw=50 
0
0.05
0.1
0.15
0.2
0.25
mix 1 mix 2 mix 3 mix 4
S
at
ur
at
io
n 
Th
ro
ug
hp
ut
 
(f
lit
s/
cy
cl
e/
co
re
) 
BAR QORE Baseline
(d) TP=500 Rw=100 
Figure 4.9: Effect Rw on saturation throughput for varying traffic mixes [2] ©2014 IEEE.
than BAR and 12.2% higher than the baseline, but has the additional benefit of being able
to handle faults [2].
4.1.7 Sensitivity Study: Varying Number of Links
In this subsection, we evaluate the effect of varying the number of links between routers,
N. We vary N from 2 reversible links between routers (Q2) to 4 links (Q4) to 8 links
(Q8). As we vary N, the total bandwidth of the links also varies in order to maintain
equal bandwidth for a fair comparison. Therefore, Q8 has half the link bandwidth of Q4
which has half the link bandwidth of Q2. Figure 4.10 shows the saturation throughput
88
0
0.05
0.1
0.15
0.2
0.25
Q
2
Q
4
Q
8
Q
2
Q
4
Q
8
Q
2
Q
4
Q
8
Q
2
Q
4
Q
8
Q
2
Q
4
Q
8
Q
2
Q
4
Q
8
Q
2
Q
4
Q
8
Q
2
Q
4
Q
8
Q
2
Q
4
Q
8
Q
2
Q
4
Q
8
Q
2
Q
4
Q
8
Q
2
Q
4
Q
8
0% 10% 20% 30% 40% 50% 0% 10% 20% 30% 40% 50%
S
at
ur
at
io
n 
Th
ro
ug
hp
ut
 (
fli
ts
/c
yc
le
/c
or
e)
 
Mix 1 Mix 2 
0
0.05
0.1
0.15
0.2
0.25
Q
2
Q
4
Q
8
Q
2
Q
4
Q
8
Q
2
Q
4
Q
8
Q
2
Q
4
Q
8
Q
2
Q
4
Q
8
Q
2
Q
4
Q
8
Q
2
Q
4
Q
8
Q
2
Q
4
Q
8
Q
2
Q
4
Q
8
Q
2
Q
4
Q
8
Q
2
Q
4
Q
8
Q
2
Q
4
Q
8
0% 10% 20% 30% 40% 50% 0% 10% 20% 30% 40% 50%
S
at
ur
at
io
n 
Th
ro
ug
hp
ut
 (
fli
ts
/c
yc
le
/c
or
e)
 
Mix 3 Mix 4 
Figure 4.10: Saturation throughput for varying percentage of link failures for different
traffic mixes for QORE of N values: 2 links (Q2), 4 links (Q4), and 8 links (Q8) [3] ©2015
IEEE.
89
for the various values of N as the link failure percentage increases. Q2 always has the
lowest throughput due to the small number of links between routers. Even though Q2 has
the highest link bandwidth, the reversibility of the links is very limited. Additionally, the
backup ring network is highly utilized in Q2. At a 50% link error rate the probability of
both links between any two routers failing is 25%. On the other hand, at a 50% link error
rate Q8 only has a 0.3% chance of all eight links between any two routers failing. This
means that Q8 will have to use the backup ring network much less often. Therefore, at
higher fault percentages (approximately greater than 20%) Q8 has the highest throughput
due to link availability and the rare use of the backup ring. However, since each link in Q8
has half bandwidth of the links in Q4, the saturation throughput of Q4 is higher at lower
fault percentages (≤20%) [3].
4.1.8 Accuracy of Decision Trees
In this subsection, we will show the results of the decision tree used to predict traffic
flow. We randomly selected four of the 19 real applications to be used for training (Black-
Scholes, Ferret, Bzip, and Radix) and use the remaining 15 applications for testing. Some
features have three values: In, Out, or Even. The value is Even if the difference between
incoming and outgoing packets is less than or equal to one packet. Other features have
two values: Low and High. The threshold to determine Low and High is the median of the
feature [3].
The ID3 algorithm was used to build the trees and the results for the +x links of router
0 are shown in Figure 4.11(a). The root of this tree is the feature which represents the link
utilization on the +x links of router 0 during the previous Rw cycles. This feature alone is
the logic used in our baseline LC shown in Algorithm 1. Depending on the link utilization,
the tree is further expanded to check other features. For example, another good predictor
is the R1 +x link difference because many of the same packets use both the R1 +x links
90
Link Diff 
In 
Out 
Even 
R1 +x 
Link Diff 
Core Buf 
Diff 
R1 +y 
Link Diff 
Low High 
R1 +x 
Lnk Diff 
In Even Out 
Req. Diff R4 +y Lnk Diff Res. Diff 
In Even Out 
R1 +y 
Buf 
R1 +y 
Buf 
R1 +x 
Lnk Diff 
In 
Out 
Even Low High In 
Out 
Even 
W B W 
E 
B W 
Low High 
B E W E W 
In 
Out 
Even 
W E W 
In 
Out 
Even 
E B B 
In 
Out 
Even 
W W B 
R0 +x Link W=West 
E=East 
B=Both 
(a) 
R5 +y Link 
Diff 
In 
Out 
Even 
R5 –x 
Link Diff 
R5 +y 
Buf 
R5 +y 
Buf 
In Even Out 
R5 +x 
Lnk Diff 
R1 –x 
Lnk Diff 
R5 +x 
Lnk Diff 
R5 –x 
Lnk Diff 
R1 +y Link 
Low High 
R5 +x 
Lnk Diff 
In 
Out 
Even In 
Out 
Even 
B N 
Low High 
In 
Out 
Even In 
Out 
Even In 
Out 
Even 
S 
S N S S 
N 
N N B B B N S N N 
N=North 
S=South 
B=Both 
(b) 
Figure 4.11: (a) Decision tree for the +x links of router 0 and (b) decision tree for the +y
links of router 1 [3] ©2015 IEEE.
and the R0 +x links. Other links in the network may have trees with different features. For
example, the decision tree for the +y links of router 1 are shown in Figure 4.11(b). The root
91
of this tree is different and represents the R5 +y link difference. The R5 links are directly
north of R1 and much of the traffic on the R5 +y links is either coming from or going to
the R1 +y links. Buffer utilization is another important feature in both trees because it is
likely that packets waiting in buffers will use the links of interest in the future. If all of the
leaves of a node have the same label then that feature is pruned from the tree and replaced
with that label. This makes the tree more efficient at test time since only two comparisons
are needed to find the label for that branch [3].
Table 4.5 shows the accuracy of the decision trees for all of the links connected to router
0 and router 1. We compare the decision tree (DT) to a uniform random labeling and to
one feature - the root of each tree. We can accurately predict the outcome better than a
random labeling in every case. On testing data, we can outperform a one feature labeling
for the x links and perform equal to or worse for the y links. The y links are more difficult
to predict due to the XY routing used in the network. In XY routing, packets use the x links
first then the y links. The y links have more possibilities as to which links/buffers were
previously used because packets on the y links could have previously been on other y links
or other x links. On the other hand, packets on the x links could have only come from other
x links. Overall, our results show a 10.35% improvement over a random labeling and a
1.25% improvement over one feature labeling. The accuracy on training data is also shown
in Table 4.5. In the certain cases, such as the R0 +y link, the tree can be trimmed down to
one feature to reduce overfitting and improve performance [3].
4.2 Soft Error Prediction with Mitigation Results
In this section, we first evaluate the overhead of the error mitigation and the predictor
module. To evaluate energy, area, and timing, each module was synthesized using the
Synopsys Design Compiler with the 40 nm TSMC technology library, a nominal supply
voltage of 1.0 V and an operating frequency of 2 GHz. The energy and area of the links
92
Table 4.5: Accuracy of four different decision trees [3] ©2015 IEEE.
R0 +x Link R0 +y Link R1 +x Link R1 +y Link 
Training 
Data 
Testing 
Data 
Training 
Data 
Testing 
Data 
Training 
Data 
Testing 
Data 
Training 
Data 
Testing 
Data 
Random 30.3% 33.4% 31.9% 33.2% 31.6% 34.0% 32.1% 35.1% 
One  
Feature 
47.8% 40.1% 44.2% 44.7% 54.0% 44.9% 45.7% 42.4% 
DT 56.7% 46.6% 52.4% 40.9% 61.1% 47.2% 53.4% 42.4% 
and other router components were evaluated with the DSENT NoC modeling tool [86]
using the 45 nm technology library, a voltage of 1.0 V , and a frequency of 2 GHz.
Next, we discuss the results of the trained decision trees. Training is done oﬄine; thus,
there is no runtime latency for training. For each link in the network, we train a separate
decision tree using separate training data sets. The decision trees are trained using the ID3
algorithm and each training data set initially has 50,000 samples. Since the data is skewed,
samples are replicated, as explained in Section 3.1.2, to bring to total size of the training
data to 150,000 samples. Then we test each decision tree using separate testing data sets
for each link. Each testing data set has 10,200 samples. We compare our prediction design
which labels samples using decision trees (DTs) to several other labeling techniques: one-
feature, weighted random (WRand), uniform random (Rand), N-always, F-always, and M-
always labeling. A one-feature labeling is a modification of our DTs which uses only the
root node in the trained decision trees to label test samples. WRand labeling will randomly
assign a label to samples based on the label distribution of the skewed training data. Rand
labeling will uniformly assign a random label to test samples. N-always will always
assign the N label to a sample which implies that this design uses only CRC. F-always
is equivalent to a design in which SECDED and CRC are always employed. M-always will
always enable relaxed transmission and CRC. The three techniques which require training
data (DTs, one-feature, and WRand) will be categorized as intelligent labeling techniques
93
and the remaining four techniques which do not require training data (Rand, N-always,
F-always, and M-always) will be categorized as non-intelligent labeling techniques.
Finally, we evaluate our design on a 64-core, concentrated mesh (CMesh) network
as shown in Figure 4.12. The network parameters are also shown in Figure 4.12. We
use a packet size of 4 flits and each flit is 64 bits. We use ACK/NACK signals at the
destination to signal successful/unsuccessful packet transmission and ACKs/NACKs per
hop to signal successful/unsuccessful flit transmission [87]. Retransmission buffers are
used at the routers which store the data until an ACK is received. If a NACK is received
then the data is retransmitted. We assume that the ACKs/NACKs use control lines which
are separate from the data lines. We inject timing errors on the data links every time a flit
is transmitted. Additionally, to account for all other errors which are not timing related,
we also inject data corruption faults which cause bits of data to be unexpectedly changed.
Therefore, data corruption faults, unlike timing faults, are not due to data arriving late.
We inject errors individually on each data link with a certain probability, based on our
model described in Section 3.1.1, which incorporates link utilization, temperature, process
variation, and wear-out. For fair evaluation, the training and testing data follow the same
probability distribution and the test data is independent of the training data per standards
for all machine learning algorithms.
We execute real traffic traces on our network using workloads from the Splash-2,
PARSEC, and SPEC CPU2006 benchmark suites. Traces were collected using the full
execution-driven simulator SIMICS from Wind River [83] with the memory package
GEMS [84]. We assume a 2 cycle delay to access the L1 cache, a 4 cycle delay for the
L2 cache, and a 160 cycle delay to access main memory. For each simulation we fix the
number of packets sent (not including retransmissions) to 32,000 packets.
94
+y 
+x 
12 13 14 15 
8 9 10 11 
4 5 6 7 
0 1 2 3 
Core Router Link 
Parameter Value 
Cores 64 
Concentration 4 cores 
Routers 16 
Data Link Width 64 bits 
Number of Links 48 
Packet Size 4 flits 
Flit Size 64 bits 
VCs 4 per port 
VC Size 4 flits 
Figure 4.12: Concentrated mesh with network parameters [1].
Table 4.6: Overhead of router components [1].
Energy Area Timing
(pJ) (µm2) (ns)
64b Link 10.333 135.2 0.515
Buffer (1 flit) 1.154 1,017.3 0.07
Switch 1.572/flit 15,402 (8x8) 0.05
4.2.1 Overhead of Error Mitigation and Predictors
The overhead of the network components are shown in Table 4.6. The energy of a 64
bit, 4 mm long link was found to be approximately 10.333 pJ. In order to simulate a design
with aggressive clocking, the delay of the link (0.515 ns) is designed to be 3% higher than
the clock period (0.50 ns). Since the clock period is less than the average arrival time of the
data, the link is susceptible to timing errors which we can mitigate with SECDED and RT.
95
Table 4.7: Error Mitigation and Predictor Overheads [1].
Module Energy Area Timing
(pJ) (µm2) (ns)
CRC Encoder 0.620 2,680.0 1.09
CRC Decoder 0.620 2,680.0 1.09
CRC Total 1.240 5,360.0 2.18
SECDED Encoder 0.072 285.6 0.24
SECDED Decoder 0.227 930.5 0.38
SECDED Parity 1.292 - -
SECDED Total 1.591 1,216.1 0.62
Predictor Total 0.002 29.8 0.17
Table 4.7 shows the energy, area, and timing overheads of the CRC, SECDED, and
predictor modules. The energy for the CRC encoder and decoder are both 0.620 pJ.
CRC uses several shift registers to obtain a check code. The value of this check code
determines if there are errors or not. The process is the same at both the encoder and
decoder; therefore, the energy, area, and timing are the same for both modules. The timing
of the encoder/decoder is 1.09 ns. Since our clock period is 0.50 ns, encoding will take
three cycles and decoding will also take three cycles. The energy, area, and latency of CRC
is the largest of our error mitigation techniques which is the reason why it is reserved only
for source and destination routers.
For SECDED, the encoder energy is approximately 0.072 pJ. The decoder energy is
the same if there are no errors, but higher if errors are corrected or detected. Since our
SECDED uses a (72,64) hamming code, there is also an energy overhead for transmitting
the additional 8 parity bits across the network links. The energy for parity bit transmission
is approximately 1.292 pJ. The latency for SECDED encoding is only 0.24 ns and can be
96
combined with the output buffering router stage. Using the Synopsys Design Compiler, the
latency of the buffer stage is only 0.07 ns. Therefore, buffering and SECDED encoding will
take a total of 0.31 ns which is well below our clock period of 0.50 ns. Similarly, SECDED
decoding can be combined with the buffer write stage at the input port.
Finally, our predictor module consumes a very low energy of 0.002 pJ and occupies
only 29.8 µm2 of area. The predictor module implements the decision tree which uses,
at most, two comparisons for each level in the tree; therefore, the overhead is very small.
The latency of the predictor module is only 0.17 ns which is under the clock period, so
predictions can easily be made every cycle.
4.2.2 Trained Decision Trees
Using the training data sets and the ID3 algorithm we trained a separate decision tree for
each link in the CMesh topology [85]. While we chose CMesh, our approach is generic and
can be applied to any topology. Figure 4.13 and Figure 4.14 shows the resulting decision
trees for a few different links. The full, three level decision tree for the +x link of router 0
is shown in Figure 4.13(a). If all the leaves of a certain node have the same label then the
tree can be pruned by replacing that node with the label. For example, the whole middle
branch in Figure 4.13(a) can be replaced by the N label. Figure 4.13(b) shows the result of
the pruned tree for the +x link of router 0. Pruning makes the trees more efficient as there
will be less comparisons which will reduce latency and energy.
Temperature is the root node for each tree in Figure 4.13 and Figure 4.14 and was found
to be the root node for every link in the topology. As temperature has such a strong effect
on probability of timing errors, this is expected. At low temperatures, the probability of
error is low. Therefore, 45 of the 48 trees (93.8%) have the N label as the leaf of the
low temperature branch, the other trees have at least one F label in the subtree of the low
temperature branch. No tree has a M label anywhere in the low temperature branch. At the
97
Temperature 
Wear Prev Error 
Prev 
Error 
Prev 
Error 
Utilization Wear-out 
N N N N F F N F N F N N 
low med high 
low 
med 
high low 
med 
high 
N 
F 
M N 
F 
M N 
F 
M 
(a) Router 0, +x Link (Full) 
N 
N 
l 
m 
h 
N 
Temperature 
Prev 
Error 
Prev 
Error 
Prev 
Error 
Wear-out N 
N F F N F N F N N 
low med high 
low 
med 
high 
N 
F 
M N 
F 
M N 
F 
M 
(b) Router 0, +x Link 
N 
Figure 4.13: Decision trees built using the ID3 algorithm for +x link of router 0 [1].
second level, 46 of the decision trees had the Wear-out feature or a label at each node. The
remaining two trees had a combination of Utilization and Wear-out at the second level. The
Wear-out feature was so common because it is another strong predictor of timing errors.
At the third level of the trees, the nodes vary mostly between the Utilization and Previous
Error features with only two trees using the Wear-out feature.
98
Temperature 
Util Prev Err 
Prev 
Err 
Wear-out Wear-out 
F N N 
M 
F F N F F M 
low med high 
low 
med 
high low 
med 
high 
l 
m 
h 
(a) Router 0, +y Link 
N 
N 
F 
M N 
F 
M 
N N 
Temperature 
Prev 
Error 
Prev 
Error 
Prev 
Error 
Wear-out 
F M M F N N F F N 
low med high 
low 
med 
high 
N 
F 
M N 
F 
M N 
F 
M 
(b) Router 1, +x Link 
N N 
Figure 4.14: Decision trees built using the ID3 algorithm for +y link of router 0 and +x
link of router 1 [1].
4.2.3 Performance of Decision Trees
Next, we will examine the confusion matrix for our decision trees. A confusion matrix
is a table that is one way to show the performance of a prediction algorithm. Each column
in the matrix is a predicted label and each row is the true label. The confusion matrix shows
how often a predictor confuses labels. The confusion matrix for the +x link of router 0 is
99
Table 4.8: Confusion matrix for +x link of router 0 [1].
Predicted
N F M
True
N 7508 2552 0
F 22 118 0
M 0 0 0
shown in Table 4.8. The diagonals of this matrix show the number of test samples that were
correctly labeled. For example, 7,508 out of the 10,200 test samples were correctly labeled
N. The non-diagonal elements of the matrix show the number of test samples that were
mislabeled. For example, 2,552 samples were labeled F when the true label was actually
N.
The average confusion matrix for all the decision trees is shown in Table 4.9. When
the sample is correctly labeled, there are no additional overheads. However, mislabeled
samples will cause either additional latency or energy overheads. Table 3.2 and Table 3.3
from Section 3.3 detail the overheads for each possibility in the confusion matrix. For
example, if a sample is labeled F when the true label was actually N then the flit will be
encoded/decoded with SECDED. However, since there were actually no errors, there will
be a slight energy overhead due to the unnecessary SECDED encoding/decoding but no
latency overhead. The exact energy of SECDED and CRC encoding and decoding was
shown in Table 4.7 and the effects of the mislabeled samples on network performance will
be shown in Section 4.2.4.
From the average confusion matrix, we calculate the accuracy of our decision trees
(DTs) compared to the other labeling techniques and the results are shown in the first
column of Table 4.10. The DTs correctly predict the true label in approximately 70.6%
of the samples which improves over a random labeling. Two labeling techniques have
100
Table 4.9: Confusion matrix averaged over all decision trees [1].
Predicted
N F M
True
N 5,981.8 1,902.8 483.1
F 121.1 565.0 334.1
M 0.8 159.8 651.5
higher accuracies than DTs with N-always the highest at 82.0% and Wrand next at 75.5%.
The reason why these labeling techniques have such high accuracy is because there is a
large percentage of samples (82.0%) with the label N. Therefore, a labeling technique such
as N-always label, which labels all samples N, will correctly predict on exactly 82.0% of
the samples. This is often called a skewed data set; therefore, a common and better metric
to consider is called the F-score give by the following equation:
Fscore = 2 · precision · recall
precision + recall
(4.1)
The F-score is a different measure of accuracy used for skewed data and is the harmonic
mean of the precision and recall which are measure of relevance. Table 4.10 shows
the average F-score for each labeling technique. The DTs have the highest F-score at
61.0% followed by the one-feature labeling which is at 52.1%. The one-feature labeling is
equivalent to using only temperature to predict since every tree had the Temperature feature
as the root. The N-always labeling which had a high accuracy has a F-score of only 30.0%.
On average, DTs have an F-score 32.4% better than all other labeling techniques.
Some mispredictions cause a small one cycle delay or a small encoding overhead.
However, there are three cases in the confusion matrix in which mislabeling in our design
will cause a full retransmission. These cases are in the lower left triangular part of the
confusion matrix: 1) Predicted N/True F, 2) Predicted N/True M, and 3) Predicted F/True
M. As these mispredictions are most costly, the total number of these three cases should be
101
Table 4.10: Accuracy, F-score, and per-hop retransmit percent due to timing errors for our
decision trees (DTs) compared to other labeling techniques [1].
Accuracy F-score Retransmit
DTs 70.6% 61.0% 2.8%
One-Feature 64.6% 52.1% 1.8%
Wrand 75.5% 45.3% 12.1%
Rand 33.3% 33.3% 8.6%
N-always 82.0% 30.0% 18.0%
F-always 10.0% 6.1% 8.0%
M-always 8.0% 4.9% 0.0%
minimized. The third column of Table 4.10 shows the percentage of samples which will
require retransmissions due to a timing error. In our design, on average, only 2.8% of the
samples cause a full retransmission. DTs are able to improve slightly over a one-feature
labeling because DTs have more nodes. Except for M-always, DTs percent of retransmits is
considerably lower than the non-intelligent labeling techniques which must retransmit on at
least 8.0% of the samples on average. The M-always labeling will never have to retransmit
due to timing errors; however, a retransmission can be caused by a data corruption error.
Additionally, every flit in M-always must be delayed two cycles for every link it uses.
4.2.4 Network Performance
After evaluating the performance of the decision trees on timing errors, we now
introduce data corruption errors and evaluate the effect of all soft errors on network
performance. First, in Figure 4.15, we show the percent of retransmitted packets in the five
best designs in terms of per hop retransmission rates as evaluated in the previous section.
The percent of retransmitted packets for each design is relatively high due to the aggressive
clocking, the multi-hop communication, and the high operating temperatures we assume.
102
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
B
ar
ne
s
FM
M
FF
T
LU
O
ce
an
R
ad
ix
R
ay
tr
ac
e
W
at
er
bz
ip
gc
c_
ba
se
hm
m
er
Fa
ce
si
m
Fe
rr
et
Fl
ui
d
Fr
eq
m
in
e
S
w
ap
tio
ns
A
ve
ra
ge
R
et
ra
ns
m
its
 
DT One-Feature
F-always M-always
Rand
Figure 4.15: Percent of packets that require full retransmission [1].
However, DTs can reduce the number of retransmissions by 23.3% on average over the
other designs. The one-feature design, which is a modification of our DTs, has slightly
more retransmissions due to more mispredictions. F-always can prevent retransmission
due to a few timing errors or a few data corruption errors; however, with many errors, a
full retransmit is still required. The M-always design will not have to retransmit for timing
errors but will retransmit for data corruption errors. Finally, the Rand labeling incurs the
most retransmissions due to the high number of mispredictions.
In Figure 4.16, we evaluate application speedup relative to M-always which is equal
to the ratio of execution time of a given design to the execution time of M-always.
Averaged over all applications, DTs executes faster than all other designs with a speedup
of approximately 3.47×. The ability of DTs to accurately predict timing errors reduces
latency delays such as one-hop retransmissions, full retransmissions, and unnecessary RT
activations. Also, the ability of SECDED to detect and correct data corruption errors
reduces transmissions; thereby, reducing the execution times. The speedup of one-feature
103
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
B
ar
ne
s
FM
M
FF
T
LU
O
ce
an
R
ad
ix
R
ay
tr
ac
e
W
at
er
bz
ip
gc
c_
ba
se
hm
m
er
Fa
ce
si
m
Fe
rr
et
Fl
ui
d
Fr
eq
m
in
e
S
w
ap
tio
ns
A
ve
ra
ge
S
pe
ed
up
 
DT One-Feature
F-always Rand
M-always
Figure 4.16: Application speedup relative to M-always [1].
labeling is less than DTs due to less features in the decision trees. The F-always design
always applies SECDED so it can avoid some one-hop flit retransmission; however, many
errors will cause a full retransmission. The Rand design improves execution time over M-
always by only 1.12× on average due to the large number of mispredictions. Finally, in
the M-always design, each flit incurs a two cycle penalty on every link traversal due to RT
being employed at all times.
Finally, we examine the component breakdown for the energy per flit in Figure 4.17.
Due to space constraints, we only show nine of the applications. During simulation, we
calculate the total energy consumption for all the flits then we divide by the number of error-
free flits which is fixed to 128,000 flits, or 32,000 packets for all simulations. Therefore,
designs with a higher number of retransmissions will require more energy to successfully
transmit 128,000 error-free flits and will have higher energy per flit values. Our DT design
reduces energy per flit by approximately 41.9% on average over all other designs. The
main reason for this reduction is the fewer number of one-hop and full retransmissions due
104
Barnes Ocean Water bzip gcc_base hmmer Facesim Fluid Swaptions Average 
0
50
100
150
200
250
D
T
O
ne
-F
ea
tu
re
F-
al
w
ay
s
M
-a
lw
ay
s
R
an
d
D
T
O
ne
-F
ea
tu
re
F-
al
w
ay
s
M
-a
lw
ay
s
R
an
d
D
T
O
ne
-F
ea
tu
re
F-
al
w
ay
s
M
-a
lw
ay
s
R
an
d
D
T
O
ne
-F
ea
tu
re
F-
al
w
ay
s
M
-a
lw
ay
s
R
an
d
D
T
O
ne
-F
ea
tu
re
F-
al
w
ay
s
M
-a
lw
ay
s
R
an
d
D
T
O
ne
-F
ea
tu
re
F-
al
w
ay
s
M
-a
lw
ay
s
R
an
d
D
T
O
ne
-F
ea
tu
re
F-
al
w
ay
s
M
-a
lw
ay
s
R
an
d
D
T
O
ne
-F
ea
tu
re
F-
al
w
ay
s
M
-a
lw
ay
s
R
an
d
D
T
O
ne
-F
ea
tu
re
F-
al
w
ay
s
M
-a
lw
ay
s
R
an
d
D
T
O
ne
-F
ea
tu
re
F-
al
w
ay
s
M
-a
lw
ay
s
R
an
d
E
ne
rg
y 
pe
r 
Fl
it 
(p
J)
 
CRC SECDED
Link Crossbar
Buffer
Figure 4.17: Breakdown of energy per flit for different design [1].
to correct predictions. Additionally, DTs can reduce the SECDED energy over designs
such as F-always and Rand when predictions are correct. The M-always design has no
SECDED energy and RT has no additional energy overheads; however, the energy per flit
is still higher due to data corruption errors which cause full retransmissions.
105
5 Conclusions and Future Work
Processor clock rates have seen a dramatic increase in the last 30 years as processors
changed to keep up with high performance demands. However, the increase in clock rate
has also led to an increase in power which cannot be sustained due to cooling limitations
and power budgets. Therefore, computer architects have transitioned into the multicore
era in which computer processors are built with tens to thousands of processing cores
called chip multiprocessors (CMPs). Each processing core operates at a low frequency
to lower power consumption; however, the multiple cores communicate with each other to
provide higher performance than a single processing core. In order for the many cores to
communicate, efficient communication infrastructures called network-on-chip (NoC) are
required. NoCs connect nodes via routers and links in topologies such as concentrated
mesh, Flattened Butterfly, and mesh. As transistor technology and the number of cores
continues to scale in the CMP, the three major concerns in NoCs are power, performance,
and reliability. High dynamic and leakage power from router buffers has led designs to
remove buffers from the router and move them to the channel in channel buffers designs.
To improve performance, some designs have implemented reversibility to take advantage
of under-utilized links. Finally, shrinking transistors, wear-out, and device aging among
other effects has led to both hard and soft errors in the NoC. While many fault-tolerant
designs are reactive, there are few proactive designs which take measures to avoid or
prevent errors before they affect the system. Machine learning techniques, which make
predictions and improve with experience, can be a useful tool in proactive designs. Machine
learning has been implemented in some multicore applications, but has not yet been used
to predict errors for a fault-tolerant design. In this dissertation, we address the three major
concerns of NoCs in two separate but related architectures: 1) the QORE architecture which
handles permanent faults with power-efficient, high performance channel buffers and 2) a
comprehensive system for prediction and mitigation of soft errors.
106
In this dissertation, we propose QORE - a fault tolerant NoC architecture using
reversible channel buffers. We use QORE’s reversibility for increased performance and
to overcome faulty links. We also engineer features and use decision trees to predict
traffic direction on the links to improve the link controllers. Simulation results show that
a decision tree predicts the direction of the traffic with higher accuracy on average than
a predictor based on thresholded link utilization. Our results on real benchmarks (SPEC
CPU2006, PARSEC, and SPLASH-2) show an increase in speedup of 1.3× and improved
throughput by 2.3× on synthetic traffic compared to related work. Using the Synopsys
design compiler, we show that QORE reduces network power by 21% while requiring
minimal control overhead [2, 3].
Next, we develop a new approach to proactive fault-tolerance which uses machine
learning (ML) algorithms to predict and mitigate errors. We provide a comprehensive fault-
prediction system in which we (a) create a methodology to obtain realistic training/testing
data sets, (b) train a ML algorithm to predict timing faults on links, and (c) mitigate for soft
errors. We develop a fault model, which accounts for parameter variation and device wear-
out, to create training/testing data sets for the ML algorithm. Using the training data set
and the ID3 algorithm, we create decision trees which can be used to accurately predict the
number of errors. Finally, we dynamically mitigate the errors using a combination of error
correction codes (ECC) and a relaxed transmission. Our results show that the energy (0.002
pJ), area (29.8 µm2), and latency (0.17 ns) overheads of the predictor implementation are
minimal. Decision trees are able to accurately predict timing errors 32.4% better than other
intelligent and non-intelligent labeling techniques. Our network results indicate a 23.3%
reduction in packet retransmissions, a 3.47× speedup, and an energy savings of 41.9% on
average over other designs.
For future work, the QORE architecture can be expanded to implement power gating.
As explained in the introduction, channel buffers consume a higher leakage power
107
compared to conventional register buffers. Since MFCs have added transistors, the leakage
power will be even greater compared to conventional channel buffers. While dynamic
power is still higher than leakage in current technology, as transistors continue to scale the
dynamic power is scaling and the leakage power will become a more significant percentage
of total power. Recently, power gating techniques have been studied to reduce the static
power in NoCs by shutting down whole routers or certain router components when not
being used [88]. Power gating techniques cut off components from Vdd so that there is no
leakage current through the transistors but the components become non-operational when
power gated. However, there is a latency associated with shutting down and waking up
components. Therefore, there is a power-performance trade-off involved with power gating
techniques.
Power gating can be implemented on the MFCs of the links in QORE. Since QORE has
multiple links between routers, power gating some of the links can have a minimal impact
on performance. At low network loads, not all of the N router links will be needed and some
can be power gated. Therefore, a fraction of the N links can be power gated depending on
the network load. Since link utilization is already being calculated and stored in QORE,
this information can easily be used to determine which links are least utilized. The latency
of shutting down and waking up the MFCs will have a minimal effect on packet latency
because with N links between routers, the packets will likely have an alternate minimal
path to reach their destination. Power gating can be done on a router-by-router basis for
optimal power savings. Power gating the MFCs in QORE can save leakage power since
this technique will power gate both links and buffers at the same time while maintaining
network performance. Another future work for QORE would be to implement the MFCs
on topologies other than mesh such as a torus network.
Another topic for future work is the design of age-aware and fault-aware routing
techniques. Typically, packets will take a deterministic path from source to destination.
108
With an age-aware and fault-aware routing algorithm, a path can be chosen which prevents
wear-out while also avoiding faults. Since link statistics are already collected in the LST of
the router, this information can also be used to determine the age of links. Each fault
tolerant routing algorithm has the same requirements in the face of faults: (i) isolate
the faulty channel, (ii) reconfigure the routing algorithm to bypass the faulty channel,
(c) indicate the new routing paths to all nodes and (d) prevent deadlocks by prohibiting
turns that form a cycle within the routing algorithm. Each of these requirements could be
addressed in future work with various techniques such as built-in self-tests for detection of
faults, efficient global communication for reconfiguration, and virtual channel allocation to
prevent deadlocks.
Future work which extends our prediction and mitigation architecture would be to
evaluate the effects of using different machine learning algorithms such as ANNs. ANNs
could be implemented with oﬄine learning like DTS or could use online learning so that
the algorithms can change during runtime. Also, a cost-sensitive approach can be taken
when training decision trees in order to reduce the number of costly mispredictions. This
can be done by weighting certain labels during training. Additionally, the fault model
for creating the data sets can use alternate models and more input parameters in order to
more accurately represent a real system. Lastly, simulations can be performed to evaluate
the effect of different probability error rates on the performance of the predictor and the
network.
109
References
[1] D. DiTomaso, T. Boraten, and A. Kodi, “Prediction and mitigation of soft errors in
NoCs with machine learning,” in submitted to IEEE/ACM International Symposium
on Microarchitecture, 2015.
[2] D. DiTomaso, A. Kodi, and A. Louri, “QORE: A fault tolerant network-on-chip
architecture with power-efficient quad-function channel (QFC) buffers,” in The
20th IEEE International Symposium on High Performance Computer Architecture,
February 2014.
[3] D. DiTomaso, A. Kodi, A. Louri, and R. Bunescu, “Resilient and power-efficient
multi-function channel buffers in network-on-chip architectures,” IEEE Transactions
on Computers, vol. PP, no. 99, pp. 1–1, February 2015.
[4] D. A. Patterson and J. L. Hennessy, Computer Organization and Design: The
Hardware/Software Interface, 5th ed. Morgan Kaufmann, 2013.
[5] H. Kim, A. Vitkovskiy, P. V. Gratz, and V. Soteriou, “Use it or lose it: Wear-out and
lifetime in future chip multiprocessors,” in Proceedings of the 46th Annual IEEE/ACM
International Symposium on Microarchitecture, 2013.
[6] A. DeOrio, L.-S. Peh, and V. Bertacco, “ARIADNE: Agnostic reconfiguration
in a disconnected network environment,” in International Conference on Parallel
Architectures and Compilation Techniques (PACT), 2011, pp. 298–309.
[7] D. Fick, A. DeOrio, G. Chen, V. Bertacco, D. Sylvester, and D. Blaauw, “A highly
resilient routing algorithm for fault-tolerant NoCs,” in Proceedings of the Conference
on Design, Automation and Test in Europe, 2009, pp. 21–26.
[8] O. Villa, D. Johnson, M. Oconnor, E. Bolotin, D. Nellans, J. Luitjens, N. Sakharnykh,
P. Wang, P. Micikevicius, A. Scudiero, S. Keckler, and W. Dally, “Scaling the power
wall: A path to exascale,” in SC14: International Conference for High Performance
Computing, Networking, Storage and Analysis, Nov 2014, pp. 830–841.
[9] S. Woo, M. Ohara, E. Torrie, J. Singh, and A. Gupta, “The SPLASH-2 program:
Characterization and methodological considerations,” 1995, pp. 24–36.
[10] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The parsec benchmark suite: Char-
acterization and architectural implications,” in Proceedings of the 17th International
Conference on Parallel Architectures and Compilation Techniques, October 2008.
[11] J. L. Henning, “SPEC CPU suite growth: an historical perspective,” SIGARCH
Comput. Archit. News, vol. 35, pp. 65–68, March 2007.
110
[12] K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon,
W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards,
A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, K. Yelick, K. Bergman,
S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod,
J. Hiller, S. Keckler, D. Klein, P. Kogge, R. S. Williams, and K. Yelick, “Exascale
computing study: Technology challenges in achieving exascale systems,” 2008.
[13] S. Borkar, “Design challenges of technology scaling,” IEEE Micro, vol. 19, no. 4, pp.
23–29, Jul-Aug 1999.
[14] “The opportunities and challenges of exascale computing,” 2010.
[Online]. Available: http://science.energy.gov/∼/media/ascr/ascac/pdf/reports/
Exascale subcommittee report.pdf
[15] C. Moore, “Data processing in exascale-class computer systems.” Presented at The
Salishan Conf. High Speed Computing, 2011.
[16] J. Shalf, S. Dosanjh, and J. Morrison, “Exascale computing technology challenges,”
in Proceedings of the 9th International Conference on High Performance Computing
for Computational Science, ser. VECPAR’10, 2011, pp. 1–25.
[17] D. E. Culler and J. P. Singh, Parallel Computer Architecture: A Hardware/Software
Approach. Morgan Kaufmann, 1999.
[18] W. Dally and B. Towles, Principles and Practices of Interconnection Networks.
Morgan Kaufmann, 2003.
[19] K. Olukotun, B. A. Nayfeh, L. Hammond, K. Wilson, and K. Chang, “The case for a
single-chip multiprocessor,” SIGPLAN Not., vol. 31, no. 9, p. 211, 1996.
[20] L. A. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano,
S. Smith, R. Stets, and B. Verghese, “Piranha: a scalable architecture based on single-
chip multiprocessing,” in Proceedings of the 27th Annual Symposium on Computer
Architecture (ISCA’00), Palo Alto, CA, 2000, pp. 282–293.
[21] J. Held, “Single-chip cloud computer: An experimental many-core processor from
Intel Labs.” Presented at Intel Labs Single-chip Cloud Computer Symposium, Santa
Clara, California, Feb. 12, 2010.
[22] M. Mattina, “Architecture and performance of the TILE-GX processor family,”
White Paper. [Online]. Available: http://www.tilera.com/sites/default/files/images/
content/White Paper Architecture and Performance TILE-Gx.pdf
[23] B. de Dinechin, R. Ayrignac, P.-E. Beaucamps, P. Couvert, B. Ganne, P. de Massas,
F. Jacquet, S. Jones, N. Chaisemartin, F. Riss, and T. Strudel, “A clustered
manycore processor architecture for embedded and accelerated applications,” in High
Performance Extreme Computing Conference (HPEC), 2013.
111
[24] NVIDIAs next generation CUDATM compute architecture: Fermi (2014,
April). [Online]. Available: www.nvidia.com/content/PDF/fermi white papers/
NVIDIA Fermi Compute Architecture\ Whitepaper.pdf
[25] J. Balfour and W. J. Dally, “Design tradeoffs for tiled CMP on-chip networks,” in
Proc. 20th ACM Int. Conf. Supercomputing (ICS), Cairns, Australia, June 2006, pp.
187–198.
[26] J. Kim, W. J. Dally, and D. Abts, “Flattened butterfly: Cost-efficient topology for
high-radix networks,” in Proceedings of 34th Annual International Symposium on
Computer Architecture(ISCA), June 2007, pp. 126 – 137.
[27] L. Benini and G. D. Micheli, “Networks on chips: A new soc paradigm,” IEEE
Computer, vol. 35, pp. 70–78, 2002.
[28] T. Bjerregaard and S. Mahadevan, “A survey of research and practices of network-on-
chip,” ACM Comput. Surv., vol. 38, no. 1, Jun. 2006.
[29] W. Dally, “Virtual-channel flow control,” in Proc. Int. Symp. Computer Architecture,
1990.
[30] M. Galles, “Scalable pipelined interconnect for distributed endpoint routing: The SGI
SPIDER chip,” in Proc. Hot Interconnects Symp. IV, 1996, pp. 141–146.
[31] S. S. Mukherjee, P. Bannon, S. Lang, A. Spink, and D. Webb, “The alpha 21364
network architecture,” IEEE Micro, vol. 22, no. 1, pp. 26–35, 2002.
[32] R. Mullins, A. West, and S. Moore, “Low-latency virtual-channel routers for on-chip
networks,” in Proc. Int. Symp. Computer Architecture, 2004, pp. 188–197.
[33] L.-S. Peh and W. J. Dally, “A delay model for router microarchitectures,” IEEE Micro,
vol. 21, no. 1, pp. 26–34, Jan. 2001.
[34] G. Michelogiannakis, D. Sanchez, W. Dally, and C. Kozyrakis, “Evaluating bufferless
flow control for on-chip networks,” in Fourth ACM/IEEE International Symposium on
Networks-on-Chip (NOCS), May 2010, pp. 9–16.
[35] P. Kundu, “On-die interconnects for next generation CMPs,” in 2006 Workshop on
On- and Off-Chip Interconnection Networks for Multicore Systems, Stanford, CA,
USA, December 6-7 2006.
[36] Y. Hoskote, S. Vangal, A. Singh, N. Borkar, and S. Borkar, “A 5-ghz mesh
interconnect for a teraflops processor,” IEEE Micro, pp. 51–61, September/October
2007.
112
[37] A. K. Kodi, A. Sarathy, and A. Louri, “ideal: Inter-router dual-function energy-
and area-efficient links for network-on-chip (noc),” in Proceedings of the 35th
International Symposium on Computer Architecture (ISCA’08), Beijing, China, June
2008, pp. 241–250.
[38] G. Michelogiannakis, J. Balfour, and W. J. Dally, “Elastic-buffer flow control for on-
chip networks,” in Proceedings of the Fifteenth International Symposium on High-
Performance Computer Architecture, 2009, pp. 151–162.
[39] D. DiTomaso, R. Morris, A. Kodi, A. Sarathy, and A. Louri, “Extending the energy-
efficiency and performance with channel buffers, crossbars and topology analysis for
NoCs,” IEEE Transactions on VLSI, vol. 2l, no. 11, pp. 2141–2154, November 2013.
[40] D. DiTomaso, T. Boraten, A. Kodi, and A. Louri, “Evaluation of fault tolerant channel
buffers for improving reliability in NoCs,” in 55th International Midwest Symposium
on Circuits and Systems, August 2012.
[41] A. K. Kodi, R. Morris, D. DiTomaso, A. Sarathy, and A. Louri, “Co-design of channel
buffers and crossbar organizations in NoCs architectures.” IEEE/ACM International
Conference on Computer-Aided Design (ICCAD), 2011.
[42] T. Moscibroda and O. Mutlu, “A case for bufferless routing in on-chip networks,” in
Proceedings of the 36th annual International Symposium on Computer Architecture,
June 2007.
[43] M. Hayenga, N. E. Jerger, and M. Lipasti, “Scarab: A single cycle adaptive routing
and bufferless network,” in Proceedings of the 42nd Annual IEEE/ACM International
Symposium on Microarchitecture, December 2009.
[44] J. Kim, C. A. Nicopoulos, D. Park, N. Vijaykrishnan, M. S. Yousif, and C. R. Das,
“A gracefully degrading and energy-efficient modular router architecture for on-chip
networks,” in Proceedings of the 33rd Annual International Symposium on Computer
Architecture (ISCA), Boston, MA, USA, June 17-21 2006, pp. 4–15.
[45] G. Mora, J. Flich, J. Duato, P. Lopez, E. Baydal, and O. Lysne, “Towards an
efficient switch architecture for high-radix switches,” in ACM/IEEE Symposium on
Architecture for Networking and Communications systems, December 2006, pp. 11
–20.
[46] Y.-C. Lan, S.-H. Lo, Y.-C. Lin, Y.-H. Hu, and S.-J. Chen, “BiNoC: A bidirectional
noc architecture with dynamic self-reconfigurable channel,” in Proceedings of the 3rd
ACM/IEEE International Symposium on Networks-on-Chip, 2009, pp. 266–275.
[47] R. Hesse, J. Nicholls, and N. Jerger, “Fine-grained bandwidth adaptivity in networks-
on-chip using bidirectional channels,” in Sixth IEEE/ACM International Symposium
on Networks on Chip (NoCS), May 2012, pp. 132–141.
113
[48] M. Hayenga and M. Lipasti, “The nox router,” in Proceedings of the 44th Annual
IEEE/ACM International Symposium on Microarchitecture, 2011, pp. 36–46.
[49] M. H. Cho, M. Lis, K. S. Shim, M. Kinsy, T. Wen, and S. Devadas, “Oblivious
routing in on-chip bandwidth-adaptive networks,” in Proceedings of the 2009 18th
International Conference on Parallel Architectures and Compilation Techniques,
2009, pp. 181–190.
[50] P. Kumar, Y. Pan, J. Kim, G. Memik, and A. Choudhary, “Exploring concentration and
channel slicing in on-chip network router,” in Proceedings of the 2009 3rd ACM/IEEE
International Symposium on Networks-on-Chip, pp. 276–285.
[51] S.-J. Chen, Y.-C. Lan, W.-C. Tsai, and Y.-H. Hu, Reconfigurable Networks-on-Chip.
Springer, 2012.
[52] JEDEC Solid State Technology Association, “Failure mechanisms and models for
semiconductor devices,” JEP122G, 2011.
[53] A. Prodromou, A. Panteli, C. Nicopoulos, and Y. Sazeides, “NoCAlert: An on-line
and real-time fault detection mechanism for network-on-chip architectures,” in to ap-
pear in The 45th Annual IEEE/ACM International Symposium on Microarchitecture,
2012.
[54] V. Puente, J. A. Gregorio, F. Vallejo, and R. Beivide, “Immunet: A cheap and robust
fault-tolerant packet routing mechanism,” SIGARCH Comput. Archit. News, vol. 32,
no. 2, 2004.
[55] J. Kim, C. Nicopoulos, D. Park, N. Vijaykrishnan, and C. R. Das, “A gracefully
degrading and energy-efficient modular router architecture for on-chip networks,” in
Proceedings of the 33rd annual international symposium on Computer Architecture,
2006, pp. 4–15.
[56] B. Grot, J. Hestness, S. W. Keckler, and O. Mutlu, “Kilo-noc: a heterogeneous
network-on-chip architecture for scalability and service guarantees,” in Proceedings
of the 38th annual international symposium on Computer architecture, 2011, pp. 401–
412.
[57] S. Lin, J. Shi, and H. Chen, “Designing cost-effective network-on-chip by dual-
channel access mechanism,” Journal of Systems Engineering and Electronics, vol. 22,
no. 4, pp. 557 –564, Aug. 2011.
[58] E. Carara, F. Moraes, and N. Calazans, “Router architecture for high-performance
NoCs,” in Proceedings of the 20th annual conference on Integrated circuits and
systems design, 2007, pp. 111–116.
114
[59] K. Constantinides, S. Plaza, J. Blome, B. Zhang, V. Bertacco, S. Mahlke, T. Austin,
and M. Orshansky, “BulletProof: a defect-tolerant CMP switch architecture,” in The
Twelfth International Symposium on High-Performance Computer Architecture, 2006,
2006, pp. 5–16.
[60] W.-C. Tsai, D.-Y. Zheng, S.-J. Chen, and Y.-H. Hu, “A fault-tolerant noc scheme using
bidirectional channel,” in 48th ACM/EDAC/IEEE Design Automation Conference
(DAC), 2011.
[61] R. Parikh, R. Das, and V. Bertacco, “Power-aware NoCs through routing and topology
reconfiguration,” in 51st ACM/EDAC/IEEE Design Automation Conference (DAC),
June 2014, pp. 1–6.
[62] R. Parikh and V. Bertacco, “Formally enhanced runtime verification to ensure noc
functional correctness,” in Proceedings of the 44th Annual IEEE/ACM International
Symposium on Microarchitecture, 2011.
[63] A. Ansari, A. Mishra, J. Xu, and J. Torrellas, “Tangle: Route-oriented dynamic
voltage minimization for variation-aﬄicted, energy-efficient on-chip networks,” in
20th International Symposium on High Performance Computer Architecture, 2014.
[64] J. Xin and R. Joseph, “Identifying and predicting timing-critical instructions to boost
timing speculation,” in Proceedings of the 44th Annual IEEE/ACM International
Symposium on Microarchitecture, 2011.
[65] S. Padmanabha, A. Lukefahr, R. Das, and S. Mahlke, “Trace based phase prediction
for tightly-coupled heterogeneous cores,” in Proceedings of the 46th Annual
IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-46, 2013,
pp. 445–456.
[66] Y. Wang, M. Martonosi, and L.-S. Peh, “A supervised learning approach for routing
optimizations in wireless sensor networks,” in Proceedings of the 2nd International
Workshop on Multi-hop Ad Hoc Networks: From Theory to Reality, 2006.
[67] S. Jayasena, S. Amarasinghe, A. Abeyweera, G. Amarasinghe, H. De Silva,
S. Rathnayake, X. Meng, and Y. Liu, “Detection of false sharing using machine
learning,” in Proceedings of the International Conference on High Performance
Computing, Networking, Storage and Analysis, 2013.
[68] J. Won, X. Chen, P. V. Gratz, J. Hu, and V. Soteriou, “Up by their bootstraps: Online
learning in artificial neural networks for CMP uncore power management,” in The
20th IEEE International Symposium on High Performance Computer Architecture
(HPCA), 2014.
[69] G. Steven, R. Anguera, C. Egan, F. Steven, and L. Vintan, “Dynamic branch
prediction using neural networks,” in Euromicro Symposium on Digital Systems
Design, 2001.
115
[70] D. Jimenez and C. Lin, “Dynamic branch prediction with perceptrons,” in The Seventh
International Symposium on High-Performance Computer Architecture, 2001.
[71] E. Kakoulli, V. Soteriou, and T. Theocharides, “Intelligent hotspot prediction for
network-on-chip-based multicore systems,” IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems, vol. 31, no. 3, pp. 418–431, March 2012.
[72] J. R. Quinlan, “Induction of decision trees,” MACH. LEARN, vol. 1, pp. 81–106, 1986.
[73] L. Chen and T. M. Pinkston, “Nord: Node-router decoupling for effective power-
gating of on-chip routers,” in Proceedings of the 45th Annual IEEE/ACM Interna-
tional Symposium on Microarchitecture, ser. MICRO-45, 2012, pp. 270–281.
[74] J. H. Collet, A. Louri, V. T. Bhat, and P. Poluri, “Robust: a new self-healing fault-
tolerant noc router,” in Proceedings of the 4th International Workshop on Network on
Chip Architectures, 2011, pp. 11–16.
[75] Y. Ho Song and T. M. Pinkston, “A progressive approach to handling message-
dependent deadlock in parallel computer systems,” IEEE Trans. Parallel Distrib.
Syst., vol. 14, no. 3, pp. 259–275, Mar 2003.
[76] S. Sarangi, B. Greskamp, R. Teodorescu, J. Nakano, A. Tiwari, and J. Torrellas,
“Varius: A model of process variation and resulting timing errors for microarchitects,”
IEEE Transactions on Semiconductor Manufacturing, vol. 21, no. 1, pp. 3–13, Feb
2008.
[77] K. Aisopos, C.-H. Chen, and L.-S. Peh, “Enabling system-level modeling of variation-
induced faults in networks-on-chips,” in 48th ACM/EDAC/IEEE Design Automation
Conference (DAC), June 2011, pp. 930–935.
[78] W. Huang, S. Ghosh, S. Velusamy, K. Sankaranarayanan, K. Skadron, and M. Stan,
“Hotspot: a compact thermal modeling methodology for early-stage vlsi design,”
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 14, no. 5,
pp. 501–513, May 2006.
[79] Y. Wang, S. Cotofana, and L. Fang, “A unified aging model of nbti and hci degradation
towards lifetime reliability management for nanoscale mosfet circuits,” in IEEE/ACM
International Symposium on Nanoscale Architectures (NANOARCH), June 2011.
[80] T. Sakurai and A. Newton, “Alpha-power law mosfet model and its applications to
cmos inverter delay and other formulas,” IEEE Journal of Solid-State Circuits, vol. 25,
no. 2, pp. 584–594, Apr 1990.
[81] P. Friedberg, Y. Cao, J. Cain, R. Wang, J. Rabaey, and C. Spanos, “Modeling
within-die spatial correlation effects for process-design co-optimization,” in Sixth
International Symposium on Quality of Electronic Design, 2005.
116
[82] K. Brayer and J. J. L. Hammond, “Evaluation of error detection polynomial
performance on the autovon channel,” in National Telecommunications Conference,
December 1975, pp. 8–21,.
[83] P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hllberg, J. Hgberg,
F. Larsson, A. Moestedt, and B. Werner, “Simics: A full system simulation platform,”
Computer, vol. 35, pp. 50–58, 2002.
[84] M. Martin, D. Sorin, B. Beckmann, M. Marty, M. Xu, A. Alameldeen, K. Moore,
M. Hill, and D. Wood, “Multifacet’s genreal execution-driven multiprocessor
simulator (gems) toolset,” ACM SIGARCH Computer Architecture News, no. 4, pp.
92–99, November 2005.
[85] J. Balfour and W. J. Dally, “Design tradeoffs for tiled CMP on-chip networks,” in
Proceedings of the 20th ACM International Conference on Supercomputing (ICS),
Cairns, Australia, June 28-30 2006, pp. 187–198.
[86] C. Sun, C.-H. Chen, G. Kurian, L. Wei, J. Miller, A. Agarwal, L.-S. Peh, and
V. Stojanovic, “Dsent - a tool connecting emerging photonics with electronics
for opto-electronic networks-on-chip modeling,” in Sixth IEEE/ACM International
Symposium on Networks on Chip (NoCS), May 2012.
[87] D. Park, C. Nicopoulos, J. Kim, N. Vijaykrishnan, and C. Das, “Exploring fault-
tolerant network-on-chip architectures,” in International Conference on Dependable
Systems and Networks, 2006.
[88] L. Chen, L. Zhao, R. Wang, and T. M. Pinkston, “MP3: Minimizing performance
penalty for power-gating of clos network-on-chip,” in The 20th IEEE International
Symposium on High Performance Computer Architecture, February 2014.
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
Thesis and Dissertation Services 
