Tackling QoS-induced Aging in Exascale Systems Through
Agile Path Selection
Dean Michael Ancajas∗

Koushik Chakraborty∗

Sanghamitra Roy∗

Jason Allred†

∗ USU BRIDGE LAB, Electrical and Computer Engineering, Utah State University

dbancajas@gmail.com, {koushik.chakraborty, sanghamitra.roy}@usu.edu
† Hill Air Force Base, Ogden, Utah

jason.m.allred@aggiemail.usu.edu

ABSTRACT
Network-On-Chips (NoCs) have become the standard communication platform for future massively parallel systems
due to their performance, flexibility and scalability advantages. However, reliability issues brought about by scaling in
the sub-20nm era threaten to undermine the benefits offered
by NoCs. In this paper, we show that QoS policies exacerbate
the reliability profile of an exascale system. To mitigate this
imposing challenge, we propose Dynamic Wearout Resilient
Routing (DWRR) algorithms in QoS-enabled exascale NoCs.
Our proposal includes two novel DWRR algorithms enabled
by a critical-path monitor and a broadcast-based routing configuration. Using PARSEC benchmarks, our best algorithm
improves QoS and long-term sustainability (Mean Time To
Failure) of the system by an average of 16% and 25% compared to a state-of-the-art fault tolerant technique, respectively.

1.

INTRODUCTION

The parallel computing landscape has undergone a rapid
transformation in the past decade. From having a few very
powerful multicores in a single chip, we are now seeing a
shift in design trends, where hundreds of simpler cores are
packed together. Some have envisioned the processor as the
new transistor, being increasingly integrated a thousand/million
times in a chip, the way transistor scaling was started [1].
Exascale computers are an example of such heavily integrated
systems, capable of at least 1 exaflop of computing power.
Along with this trend, high speed and high bandwidth onchip interconnect systems (Network-On-Chips or NoCs) have
also superseded unscalable bus interconnect architectures to
complement the enormous on-chip parallelism.
NoCs feature a high performance and scalable decentralized communication system that sends data as packets instead of wire signals. As with large exascale computing systems sharing a limited number of resources, a Quality-ofService (QoS) policy needs to be enforced in order to ensure
fairness among different users/programs running in the system [2].
Several recent works have tackled the problem of providing a scalable
QoS in exascale computing systems [2][3]. However, many of these
proposals do not consider the reliability impact of providing QoS.

In this work, we show that supporting QoS in an exascale
NoC has profound implications on its reliability. Our experiments show that as an NoC is scaled, providing QoS dramatically reduces its Mean Time To Failure (MTTF) due to the
increased power consumption and elevated thermal profile.
This increased power/thermal characteristics from QoS support does not come from a performance increase, rather from
a more proportionate resource management enabled
through QoS support [3]. Thus, although the NoC system
continues to offer an identical bandwidth, QoS support
results in an accelerated wearout and a reduced lifetime.
To effectively increase the lifetime of QoS-enabled exascale
NoCs, this work introduces Dynamic Wearout Resilient Routing (DWRR) algorithms. Our DWRR algorithms balance QoS
and reliability impact by extenuating the additional stress induced by enforcing QoS guarantees in the system. Our proposed schemes are based on a cross-layer theme, where device level wearout is sensed at the circuit layer, and communicated to the architecture layer to dictate the routing path
selection. Overall, we increase the lifetime by aiming to retain uniform wearout among all routers in an exascale NoC.

We make the following contributions in this paper:

• Reliability Analysis of QoS and Scaling Effects in Exascale NoCs: We show an analysis of the reliability impact of providing QoS in large scale NoCs. Our experiments show that the MTTF of a 1024-node NoC can be
reduced by 36.6% (two years) of its rated lifetime just
by supporting QoS (Section 2). In the same light, as we
increase the number of nodes in the system from 64 to
1024-nodes, the MTTF can be reduced by as much as
48%. These looming problems signify a need for managing the resources in an exascale NoC appropriately.
• DWRR Algorithms: We develop two novel DWRR algorithms (Section 3.4). Our first algorithm, fresh routing (FR), lengthens NoC lifetime by always prioritizing
the least degraded path at the cost of reduced QoS. Our
second algorithm, latency-reclamation routing (LR), gives
balanced priority to aging and QoS. To the best of our
knowledge, this is the first work to tackle the tension
between reliability and QoS in an exascale NoC.
• Top-down Evaluation Platform: Our top-down simulation platform provides a rigorous assessment of sys-

64-node

256-node

1024-node

80
60
40
20
0
0

0.02

0.04

0.06

0.08

0.1

Injection Rate
Figure 1: Effect of Scaling on Average Router Power Consumption (16nm node).

tem reliability. The infrastructure we created spans the
whole spectrum of design abstraction layers, from architectural simulation down to transistor-level analysis. In this process, we integrate several tools such as
Booksim 2.0, DSENT 0.91, HotSpot 5.02, an in-house
logic analysis tool and HSpice. These tools perform architectural simulation, power analysis, thermal analysis and delay degradation evaluation of major wearout
mechanisms. Compared to the recently proposed stateof-the-art VICIS [4], our best algorithm improves QoS
and long-term sustainability (Mean Time To Failure) of
the system by an average of 16% and 25%, respectively,
while avoiding an increase in flit latency.

2.

64-node

MOTIVATION

In this section, we present a quantitative analysis of the
threat posed by rapid device level wearout on an exascale
NoC design. In particular, two key sustainability challenges
stemming from wearout on these systems are: (a) wearout
impact of NoC scalability (Section 2.1); and (b) adverse interaction between QoS support and NoC lifetime (Section 2.2).
Finally, we briefly outline the profound significance of our
analysis on exascale NoC designs in Section 2.3.

2.1 Scalability Impact on NoC Wearout
Figure 1 shows the effect of scaling on the average router
power. As we increase the number of nodes in the system,
more power is needed to support the enormous communication bandwidth/traffic demand imposed by more nodes.
For instance, at an injection rate of 0.1 flits/cycle, a router in
an exascale system (1024-nodes) will have to consume about
2.2× the power compared to a router in a 64-node system.
For low network loads, the power consumption difference is
barely 20%. Such large power increase leads to higher power
and thermal densities, degrading the system reliability.
Figure 2 shows the detrimental lifetime impact of wearout
on exascale NoCs at 16nm, measured using the Mean Time
to Failure1 (MTTF). The NoC is organized as a 2D mesh handling a uniform-random traffic. As the NoC is scaled from
64 to 1024 nodes, the components in the central region experience rising stress from handling a larger number of traffic
flows, resulting in a significant MTTF reduction. For exam1 To estimate the MTTF, we simulate the NoC’s delaydegradation at a month granularity under several major device wearout mechanisms [5, 6], until we reach a threshold
value (see Section 4).

256-node

1024-node

10

MTTF (Years)

Power (uW)

100

8
6
4
2
0
0

0.02

0.04

0.06

0.08

0.1

Injection Rate
Figure 2: Trend for Scaling impact on MTTF (16nm node).
ple, at 0.06 flits/per cycle injection, the expected lifetime of
a kilo-node NoC is 48% less than a 64-node NoC. Furthermore, higher injection rates are increasingly likely on exascale NoCs running big data applications [7]. These applications with massive data footprints also show more concentrated traffic patterns (e.g., substantially more memory traffic
than inter-processor communication) [8]. Collectively, these
trends can play a havoc in drastically shortening the lifetime
of the NoC. For example, doubling the injection rate from
0.04 to 0.08, shortens the lifespan of the 1024-node system by
45%, compared to 16% in the 64-node system.

2.2 QoS Support and NoC Lifetime
To uncover the conflict between wearout resilience and QoS
support, we use the GSF QoS policy [3] to run a benchmark
with heavy memory traffic on an NoC supporting an exascale system with distributed memory controllers. This benchmark broadly represents a big-data application, where the
process-memory communication typically dwarfs inter-processor
communication [8]. Now the question is how does QoS support alter the stress patterns seen in the NoC?
To understand this aspect, one must recognize that QoS support does not alter the underlying bandwidth or performance offered by the NoC [2]. Instead, QoS support leads to a change in
how various resources are proportioned among various flow
demands. For example, a fairness guarantee through QoS
support leads to a more uniform distribution of bandwidth
among the on-chip flows. This phenomenon, also observed
by [3], leads to enable more flows to simultaneously communicate through the NoC, causing a dramatic increase in the
power and thermal profile of the system. To explain this concept, we show a simple example in Figure 3(a), where three
nodes A, B and E attempt to send flits to D. Without QoS,
nodes A and B can be unfairly treated (only receiving 1/4th
of the bandwidth due to contention). However, QoS support
fairly distributes the E − D link bandwidth between all three
nodes (A,B, and E), resulting in increased power consumption from the greater network activity. Note that the total
bandwidth available in the network does not change when
supporting QoS.
Such increased power and thermal profile from QoS support causes accelerated wearout in the NoC devices. Figures
3(b) and 4 show the impact of QoS support on the power and
MTTF of the NoCs. At larger network sizes, the difference
in power consumption between supporting QoS and not is
even more pronounced as there are more traffic flows that
need to be supported. We find that the maximum number of
flows that has to be guaranteed at a certain level of QoS are

Authorized licensed use limited to: Utah State University. Downloaded on November 11,2022 at 18:59:23 UTC from IEEE Xplore. Restrictions apply.

With QoS
No QoS
Difference

10
MTTF (Years)

8
6
4
2
0

(a) QoS effect on the Network Traffic.

64

256
1024
Network Size
(b) MTTF Impact of QoS Support.

Figure 3: Conflicting Goals of QoS Support and Sustainability: Although the bandwidth offered by the NoC remains unchanged,
different resource usage under QoS causes an accelerated wearout and a shortened lifetime.

Power (x10 uW)

1
0.8
0.6

With QoS
No QoS
Difference

0.4
0.2
0

64

256
Network Size

1024

In this section, we describe the design of our proposed
QoS aware wearout resilient routing techniques for exascale
NoCs. Our proposed three-step approach spans multiple layers to manage wearout degradation in NoCs, while maintaining short term power-performance goals: 1) we sense
the device level wearout of routers and links in the circuit
layer using our NoC Health Meter (Section 3.1); 2) we communicate between the circuit and the architecture layers by
propagating the wearout information across the NoC (Section 3.2); and 3) we apply wearout information in the architecture layer during NoC routing to dynamically mitigate the
effects of aging (Section 3.4).

3.1 NoC Health Meter (NHM)
Figure 4: Effect of Providing QoS on the Average Router Power
Consumption.

2016, 32640 and 523,776 for 64, 256, 1024 nodes, respectively.
Consequently, we notice that QoS support has a pronounced
detrimental impact on the lifetime of an exascale system. We
observe that adding QoS support at 1024 nodes decreases the
MTTF by almost two years or 36% of its rated lifetime.

2.3 Significance
Without a reliability-aware policy in place, the QoS-policy
dramatically decreases the average MTTF of an exascale system. However, for large systems, QoS is indispensable to
avoid the problem shown in Figure 3(a). Then, a key question is how can we enable QoS support while limiting its
damaging impact on the NoC lifetime? Thus, there is a need
to implement reliability-aware schemes to improve MTTF,
while still maintaining guarantees enforced by QoS policies.
In this work, we propose a reliability-aware routing algorithm to balance the lifetime of NoC routers by using network wide aging information to route packets.

3.

DESIGN OF WEAROUT RESILIENT ROUTING IN AN EXASCALE NOC

To guide our routing algorithm, the NHM profiles the extent of degradation in each router and incoming links. The
degraded delay of incoming links are measured as part of the
first stage of the router (i.e. input buffers). The NHM circuit
shown in Figure 6 augments all the pipe stages of a router.
Our proposed meter essentially measures the delay degradation in the combinational circuit between two pipeline registers by measuring the slack in each stage. To this end, we use
a high resolution all-digital, self-calibrating time-to-digital
converter (HR-TDC) consisting of a Vernier Chain (VChain)
circuit that has a measurement resolution of 5ps [9]. After
measuring the delay degradation of each stage, we estimate
Dmax : the maximum degradation among all pipe stages.
The HR-TDC is an in situ delay-slack monitor consisting of
a Vernier Chain circuit with an overall measurement window
of 150ps, which is sufficient for timing slack measurements
in 2Ghz+ systems. To accommodate process variability, the
HR-TDC is calibrated post-fabrication so that each HR-TDC
stage is tuned to 5ps increments. To also avoid the expensive
cost of post-silicon calibration (i.e. off-chip measurements
and testing), the HR-TDC allows automatic self-calibration
under the control of a firmware and using only an off-chip
crystal oscillator for clock generation. Fick et al. has demonstrated that a complete full self-calibration of an entire TDC
implemented on a 64-bit Alpha processor can take only five
minutes [9].

Authorized licensed use limited to: Utah State University. Downloaded on November 11,2022 at 18:59:23 UTC from IEEE Xplore. Restrictions apply.

Figure 5: High Resolution In-Situ Delay Slack Measurement from Fick et al. [9].

3.1.1 Using HR-TDC in NoCs
In this section, we discuss in more details, the use of High
Resolution Time-to-Digital Converters in NoCs.
Using HR-TDC circuits to measure the slack or propagation delay of each pipeline stage in an NoC is important because exascale chips with thousands of nodes can experience
both global and local Process-Voltage-Temperature (PVT) variability. We show in Figure 5, an in situ delay-slack self-calibrating
Time-to-Digital converter from [9]. The HR-TDC operates in
three modes:
1. Normal operation - The HR-TDC is measuring the delay fed from the NoC Data Path. We do not measure all
paths as it is expensive, instead we only measure 30%
of the top most critical paths as done by [10]. Subsequently, the data from the Time-to-Digital converter is
sent to the NHM and aggregated to decide the minimum slack (max delay) amongst all pipeline stages.
2. Reference Delay Chain (RDC) Calibration - The HRTDC is measuring the delay of the "Reference Delay
Chain" using statistical sampling. Calibration of the
RDC is needed before VChain calibration is started.
3. Vernier Chain Calibration - The HR-TDC is calibrating
the Vernier Chain so that each stage in the chain has a
delay of 5ps. A stage in the VChain is made tunable by
using eight firmware-controlled capacitor loads, with
each load designed to introduce 1ps shifts in the delay.
The Vernier Chain (i.e. red portion of Figure 5) is responsible for measuring the slacks from the NoC data paths in each
pipeline stage and converting it to a digital code. The rest
of the modules in the figure (i.e. green modules) are used
as a support or for self-calibrating the delays in the circuit.
Immediately after fabrication, the HR-TDC is run in modes
1 and 2. At runtime, the HR-TDC is turned off. It is only
turned on during boot-time sequence to measure the slacks
of the pipeline stages. If it is suspected that the delay chains
in the HR-TDC have changed, it can also be re-calibrated (run
in modes 1 and 2) in order to maintain high accuracy in the
slack measurements.

3.1.2 Hardware Overhead
The NHM is a low overhead circuit for measuring router
delays. We assume a router is faulty when its delay exceeds
20% more than the manufactured clock frequency. Overall,

Figure 6: NoC Router Augmented with NHM.

our implementation of the NHM on top of the open-source
NoC router [11] at the 16nm technology node yields 3.2% and
1.2% overheads in area and power, respectively.

3.2 Propagating Delay Information and Routing Table Update
We estimate and propagate the encoded delay information through the firmware during the system boot-up, once
a month. We use three steps to perform this function. First,
all nodes estimate their own Dmax in parallel throughout the
system (Section 3.1). Second, we broadcast the Dmax through
the flit link network. However, to avoid extreme flooding,
we divide the network into small equally sized regions. Then,
one node from each region broadcasts its Dmax throughout
the system. Third, the routing tables in each node are updated using this Dmax information. For a 1024-node system,
we find that Dmax propagation of all nodes can be performed
under 0.5ms, assuming a 2GHz clock. Doing this process at
boot time avoids any runtime overhead, while negligibly affecting
system boot time.

3.3 Routing Algorithm
In this section, we discuss the light-weight and scalable
routing algorithm that we use to route packets in order to
balance path degradation across the NoCs.

3.3.1 Scalable Routing

Authorized licensed use limited to: Utah State University. Downloaded on November 11,2022 at 18:59:23 UTC from IEEE Xplore. Restrictions apply.

src

0

4

8

1

2

3

5

6

7

9

10

dest

11

Figure 7: Two-Turn Path Routing.

Turning Point #1
Destination

direction bit

Turning Point #2

Figure 8: Head Flit: Unused Fields can be used to store additional algorithmic routing information.

The routing algorithm that we use profiles all two-turn
minimal paths of all source-destination pairs. The paths are
chosen based on a particular metric such as average router
degradation or maximum router degradation. Note that the
path for a particular source-destination pair is only updated
once per month in the boot-up time after all wearout information are propagated throughout the system. The firmware
selects which path to take by writing the node address of
the turning points in a routing register. Subsequently, the
NoC will read this register at runtime and encode routing
information in the head flit. For scalability purposes, our approach uses algorithmic routing to decide which port the flit
should be sent to.
Figure 7 shows an example of our routing algorithm in action. In this example, it is assumed that the firmware has
already decided which turns to make for a flit with a sourcedestination of 0 and 11, respectively. The turns are made on
nodes 2 and 10. Additionally, a single bit in the head flit is
used to indicate which direction the flit should first go, X
or Y direction. Whether it is up/down or left/right will be
decided by the algorithmic routing based on the relative address of the source and the turning points. Once the flit hits
one of the turning nodes, it is going to turn towards the direction of the destination. As such, our algorithm is very scalable because no matter what the size of the exascale NoC is,
the routing information stored in a flit (i.e. address of turning points) to be sent from a node to another will only grow
by log(n ) with n being the # of nodes. We show all these
information in Figure 8. The total overhead is minimal at
2 × log(n ) + 1 bits.
As we have discussed above, we only choose between twoturn paths when deciding which routes to use. The reason
for using this radically smaller solution space is that using
three or more turns does not provide any significant benefits
but will cost linear overhead on flit space requirements (i.e.

m-turn path will need m × log(n ) bits in the flit to route). In
fact, our simulations show that using an unlimited number
of turns yields less than 4% degradation at the most, compared to two-turn paths.

3.3.2 Deadlock Avoidance
Routing packets using various two-turn path configurations can lead to protocol deadlock when cyclic resource dependencies exist. In our DWRR implementation, we allocate 1 Virtual Channel (VC) in each port as an escape channel
only to be used when avoiding a deadlock. Normally, when
there is no contention, the flits will be routed on the nonescape channels. However, when all non-escape VCs from all
routers are occupied for a certain period of time (100 cycles
in our simulations), a cyclic dependency could exist. This is
possible because we do not restrict flits to use the same VC
ID in each hop in order to maximize the bandwidth of the
network. We break this cyclic dependency by halting further
injection in the NoC and allowing in-flight flits to arrive at
their destination using deterministic routing via the escape
channels.

3.4 Applying NoC Health Meter in Dynamic
Wearout Resilient Routing
We propose two DWRR algorithms that harness the platform provided by the NoC health meter to dampen the additional QoS-induced traffic stress in NoC routers. We use
Duato’s theory to restrict virtual channels to specific packet
classes to avoid deadlocks [12]. Our first algorithm is Fresh
Routing (FR), which always routes the flits using the leastdegraded path. This path is constructed by considering several minimal paths and comparing the average wearout information in each path. Our second algorithm, Latency Reclamation routing (LR), seeks to balance congestion and reliability objectives by using dynamic runtime information when
deciding a path. LR first compares the number of available
credits–a metric quantifying the level of congestion in a node–
of neighboring routers. If the least degraded path is congested, LR will choose the non-congested path. Let us consider a routing path with p routers, having maximum delays
D1 , D2 , ..., D p , respectively. We study two variants of both FR
and LR using runtime information:

• FR Avg - This scheme uses the average wearout of all
routers in a path to select the least-aged path. (D path =
avg( D1 , D2 , ..., D p )).
• FR Max - This variant of the FR algorithm selects the
least-aged path using the maximum router wearout of
each path. (D path = max ( D1 , D2 , ..., D p )). This scheme
seeks to limit the wearout of the most degraded router
at any time interval.
• LR Avg - This scheme is similar to FR Avg , selecting the
least-aged path based on average. However, during
congestion, it avoids queuing delay by sending flits in
the direction with more credits at times, when the leastaged path is overly congested. We discuss more details
about congestion-awareness in Section 3.4.1.
• LR Max - This variant of the LR algorithm also allows
credit-based exceptions to the least-aged path. However, like the FR Max scheme, it determines the leastaged path using the maximum router delay in each path.

Authorized licensed use limited to: Utah State University. Downloaded on November 11,2022 at 18:59:23 UTC from IEEE Xplore. Restrictions apply.

degraded but not congested

chitectural down to circuit-level simulators, in order to accurately assess long-term degradation in NoCs (Section 4.2).

dest

p2
src

p1
not degraded but congested

Figure 9: Different Path Objectives of DWRR vs. CAR. DWRR
uses p1 while CAR uses p2.

3.4.1 Congestion Awareness
Our DWRR algorithms have a congestion-aware variant,
LR ave and LRmax . We did not use a complicated congestion avoidance scheme because the overhead of providing
network-wide congestion data to each node in an exascale
system is very prohibitive. For instance, in an n × n mesh
network, adding a point-to-point congestion network would
need 2 × n × (n − 1) × m wires for an m-bit congestion resolution. Instead, we use the RC-1D [13] congestion aware
algorithm to supplement LR ave and LRmax . In this scheme,
the available bandwidth credit in each router is propagated
in a single dimension in the mesh. So DWRR dynamically
estimates the level of congestion in various paths by comparing their respective bandwidth credits. When the reliabilityaware path is congested (i.e., low credit), we can choose a less
congested path, temporarily overriding the reliability awareness.
We acknowledge that there are other possible congestion
aware algorithms, such as piggybacking schemes (GCA [14])
that do not add significant area overhead but are prone to estimation errors due to fast changing congestion information
in the network. These estimation errors would only be exacerbated in an exascale setting because congestion profiles
will take longer to arrive.

3.4.2 Reliability vs Congestion Awareness
DWRR and congestion-aware routing (CAR) are both adaptive algorithms, but are designed to satisfy two distinct objectives. Consequently, routing paths for a given flow may
differ widely depending upon the conditions currently prevailing in the NoC. For example, a DWRR may voluntarily
choose a congested path so as to reduce component aging
(p1 in Figure 9). These algorithms also differ based on the information tracking granularity. While CAR requires frequent
(few clock cycles) update on congestion information, DWRR
uses a very coarse grain aging information flow due to the
slow progress of the device wearout (months).

4.

METHODOLOGY

In this section, we discuss the simulation infrastructure
that we use to evaluate aging of an exascale NoC. Doing
a lifetime simulation (5-10 years) to assess aging impact is
computationally prohibitive. Instead, we employ a methodology to accurately capture the long-term aging impact in
NoCs, while keeping a tractable simulation time. We first
discuss our NoC model with respect to how we evaluate its
lifetime (Section 4.1). Then, we explain our cross-layer simulation framework that uses several tools ranging from ar-

4.1 Exascale NoC Reliability Evaluation
Our evaluation of network-wide reliability encompasses
several layers in the hierarchy. We start at the level of an NoC
running a real-world application and traverse different design levels (modules, gates etc.) down to the transistor level.
Figure 10a shows this hierarchy along with supplemental information. We briefly describe the role that each level plays:

• Network-On-Chip: We model an exascale NoC at the
16nm technology node [15]. The NoC is composed of
1024 nodes, connected by 256 high-speed virtual channel routers implementing the GSF [3] QoS policy, and
running real-world traffic benchmarks. As NoC system utilization plays a large role in aging [16], all traffic
patterns are accurately captured using an architectural
simulator. All simulation statistics are then propagated
down the hierarchy for further evaluation.
• Router: The router has two equivalent models, one for
the architectural simulation and a circuit-level model
for per-module power evaluation. The architectural model
is embedded in Booksim [12] and emulates a QoS-aware
router with 6 virtual channels, 8 input and output ports
each, 5 buffers per virtual channels, and a 3-stage speculative pipeline (Route Calculation, Virtual Channel Allocation, Switch Traversal). For power evaluation, we
use DSENT [17] that models the router as a collection
of modules (e.g., crossbar, buffers, allocators). We configure DSENT to use the same parameters as that of
our architectural model. The power information will
be used for Bias Temperature Instability (BTI) analysis.
• Modules: The router modules are modeled as a collection of gates. While many wearout mechanisms are
power-thermal driven, some also depend on switching
activities. For example, Hot-Carrier Injection (HCI) aging in the crossbar circuit as it is the module that dictates cycle time [18,19]. To precisely model HCI, we collect actual bit patterns during architectural simulation
to accurately capture the switching activity of different
modules, which dictates the Vth shift due to HCI [20].
• Gates: Gates are modeled as a collection of transistors connected together. Not all switching transitions
cause HCI degradation, therefore we only analyze the
pertinent transitions of the most susceptible circuits in
the NoC to reduce simulation time [21]. Vth degradation due to BTI and TDDB (described below) are captured at the gate-level by estimating the delay degradation in each individual gate through HSpice simulation. Subsequently, we use static timing analysis to
evaluate the timing characteristics of various modules
under wearout by carefully aggregating the aging induced gate delays.
• Transistor: At the transistor level, we model aging degradation as a combination of both BTI and HCI effects.
BTI, Time-Dependent Dielectric Breakdown (TDDB) and
Electromigration (EM) degradation are accelerated by
large temperatures [22], while HCI is manifested from
excessive switching activity. We use the information
gathered from the higher levels in the hierarchy and
calculate threshold voltage (Vth ) degradation using the
aging models from [5, 6]. Both these mechanisms cause

Authorized licensed use limited to: Utah State University. Downloaded on November 11,2022 at 18:59:23 UTC from IEEE Xplore. Restrictions apply.

Figure 10: Reliability Aware Framework.
accelerated Vth degradation, decreasing timing slack,
and eventually causing failure.

4.2 Lifetime Simulation
Detailed lifetime reliability simulation of an exascale NoC
is a computationally intensive task because aging-induced
degradation takes years before manifesting as timing errors.
In our simulations, we develop a technique to accurately approximate the lifetime aging impact of running different benchmark programs on an exascale NoC.
Our simulation framework is composed of several tools.
We use Booksim 2.0 [12] and Netrace 0.9 [23] for accurate
traffic modeling without incurring the overhead of a fullsystem architectural simulation. Power and Temperature modeling of different NoC modules are done using DSENT 0.91
[17] and HotSpot 5.02 [24]. We also implement an in-house
switching activity analyzer to observe HCI stress of transistors. Lastly, we use Synopsys HSpice to evaluate timing degradation of BTI and HCI induced aging degradation.
We build our reliability framework as a closed-loop system
with each iteration equivalent to one (1) month of wall clock
time. As such, a simulation of N years would take N×12
iterations. We explain each step of the iteration:
1. We run real application benchmarks (PARSEC) using
the Gems/Garnet setup to gather accurate traffic patterns and NoC router utilization profiles.
2. After the architectural simulation, we use the data dump
and per-router utilization to evaluate power statistics
using DSENT.
3. DSENT’s power outputs are then used as input to HotSpot
5.02, along with a minimalistic floorplan (also from DSENT),
to obtain the steady state temperature profile of each
router. Meanwhile, the switching activity analyzer outputs the HCI profile of the representative circuit we
evaluate [11]. Note that we only use a small circuit
(crossbar) as the switching activity analysis is a computationally intensive task.
4. Our aging models [5, 6] take as input the current time,
the switching and the temperature profile to calculate
the aging induced Vth degradation. The change in Vth
is then annotated in the HSpice simulation and is used
to simulate the new slack of each router.
5. The new slack values are then annotated back to the
Booksim simulator and used by our routing algorithms
to calculate the new paths with minimal aging impact.

The time variable for aging analysis is also incremented.
This process is repeated until the end of lifetime of the
chip (i.e. when a slack threshold is reached).

5. EXPERIMENTAL RESULTS
In this section, we evaluate the effectiveness of our proposed DWRR algorithms. Our baseline scheme, labeled Base
is an exascale system with a QoS policy similar to GSF [3]. We
evaluate four schemes: FR Avg , FR Min , LR Avg , LR Min . The
details of all these schemes are discussed in Section 3.4. We
compare the effectiveness of these schemes with the following metrics: the overall system MTTF, the quality-of-service
through mean-absolute difference (MAD) of packet latency and
coefficient of variance and lastly, performance.
We compare our schemes against a state-of-the-art related
work in fault tolerance called VICIS. VICIS is an NoC recovery mechanism proposed by Fick et al. [4]. VICIS provides
many enhancements in an NoC router. The most notable is
a crossbar bypass bus that is used to transfer flits from the
input port to the output port, once a wearout-induced hard
fault is manifested in the crossbar switch. Essentially, the
router is reconfigured to a bare bone version trading off performance for correct functionality.

5.1 MTTF: Fault Model and Comparison
MTTF is traditionally defined as the time it takes from the
start to the point of the first component failure. In NoCs,
this is usually the point where a router in the network fails.
However, proposals such as VICIS tolerate failures by providing hot-swappable bare bone components, thus retaining functional correctness at the cost of performance. As
more components fail, the performance penalty rises till the
system loses its ability to offer favorable cost-performance
benefit over replacing the system with a new one. To capture this interplay while having a fair comparison of VICIS
against our schemes2 , in this work, we consider a system to
be faulty only when 5% of the routers in the network have
failed. Moreover, we chose a 20% delay guardband for each
component. Since VICIS offers fault tolerant routing, it may
appear that this fault model is not fully justified. However,
in reality, our experiments suggest that a more relaxed component failure threshold or an aggressive guardband further
2 Recall

that our proposed schemes have an orthogonal goal
of extending fault-free and high-performance communication.

Authorized licensed use limited to: Utah State University. Downloaded on November 11,2022 at 18:59:23 UTC from IEEE Xplore. Restrictions apply.

Figure 11 shows the network latency in blackscholes (one of
the PARSEC benchmarks), as the system progressively degrades. The point A is the time of the first router failure, the
Base scheme stops at year 3. VICIS extends the lifetime of the
NoC by replacing broken routers with bare bone versions at
the cost of lower performance. Meanwhile, our schemes only
use DWRR to extend the lifetime of the routers. Eventually,
VICIS fails around 8 years (point B). FR avg improves the performance by further pushing the point of the first router failure to C (from A), thereafter complemented with VICIS to
promote both wearout resilience and fault-tolerance. Eventually, FR fails at 9.8 years (point D). The occasional improvement in latency is due to specific routing paths exercised in
that time frame (e.g., paths with fewer degraded routers),
although device level wearout monotonically degrades the
router delay characteristics.
Figure 12 shows the overall MTTF of the different schemes.
Base has the least MTTF with an average of 5.6 years. VICIS
extends the MTTF by 45% to 8.1 years. Our routing schemes
provide an additional 25% MTTF improvement on top of VICIS, on an average. Among our proposed schemes, the minimum improvement over VICIS is at 15%, while the maximum is at 26%. FR does not consistently outperform LR in
terms of MTTF because the former only cares about routing
all packets through the least degraded path. This approach
can be occasionally detrimental as a particular path may be
overly used by all flits once it is determined to be optimal for
a certain time.

5.2 Quality-of-Service
We show Quality-of-Service (QoS) using two widely used
metrics in Figures 13 and 14. Figure 13 shows the jitter of the
programs under different comparative schemes, where jitter is measured using the Mean-Absolute-Difference (MAD)

VICIS

Base
B

190

Latency

• If we allow more component failures before considering a system to be faulty, the estimated MTTF improves
both for VICIS and DWRR, but the relative improvement in DWRR is even more. On the other hand, the
performance of the system suffers significantly beyond
5% failures. We can notice the substantial rise in latency
soon after a component failure in VICIS. Moreover, as
VICIS replaces faulty buffered routers with logically
bufferless components, packets tend to incur more and
more non-deterministic delays, further hurting the QoS
support of the underlying network. Figure 11 shows
this, where after the 3rd year, VICIS’ latency has worsened compared to FR avg .
• If we allow an aggressive guardband in VICIS, we can
marginally increase the operating frequency of the network. However, this slightly higher frequency does not
provide any long-term performance boost. First, allowing a smaller guardband leads to faster component failures (e.g., much before the 3-year mark shown in Figure 11). Subsequently, VICIS starts to show poor performance compared to the relaxed guardband, due to
the increased latency from the bufferless routers. Second, the MTTF is also reduced as the point of 5% failures is reached earlier. Overall, aggressive guardbanding neither helps performance or QoS, nor does it help
the MTTF.

FRavg

210

170

C

150
130

D

A
0

1

2

3

4

5

Year

6

7

8

9

10

Figure 11: NoC Performance of blackscholes under progressive
aging in various schemes.

Years

degrades VICIS performance compared to DWRR. The reasons for this result are twofold:

12
10
8
6
4
2
0

Base

VICIS

l
s
k
ole ytrac nnea
ch
a
s
d
c
k
bo
lac

b

FRavg

FRmax

ret mate tions
fer
ni
ap
sw
ida
flu

LRavg

s
vip

LRmax

x2

64

g
Av

Figure 12: MTTF (Higher is better).

[25]. For a majority of the programs, our best schemes provide better resilience to jitter, compared to VICIS. Our schemes
achieve this by dampening the traffic-induced stress incurred
by adding QoS. The DWRR algorithms redistribute the traffic around degraded routers, hence, extending the lifetime of
high performance routers. Consequently, our schemes maintain the QoS in the system, while VICIS’s QoS degrades due
to many low-performance bare bone routers replacing the
original routers. LRmax provides the best improvement for
swaptions that see 30% more resilience to jitter. On an average, the improvement is about 9.3% for LRmax . Some benchmarks provide little or no benefit in terms of jitter resilience,
primarily due to low-levels of congestion in them, offering
little opportunity to improve jitter.
Figures 14 and 15 show another QoS metric which measures the dispersion of packet latencies using Coefficient of
Variation (CoV = σµ ). Figure 14 shows the lifetime CoV of
all schemes. On an average, our FRmax and LRmax schemes
improve the CoV by 7% and 16%, respectively. The biggest
improvement is from ferret which is reduced by 70-75% when
using our schemes. Three programs (canneal,swap,vips) show
degradation of QoS for FR schemes due to congestion, however, the LR schemes are able to dynamically reroute traffic
and improve the QoS.
Figure 15 is a time-lapse graph of blackscholes showing how
VICIS provides a functional platform with less QoS when
routers start to fail. From year 3 to 8, VICIS has higher CoV
compared to our schemes because it uses bare bone routers
and takes a longer time to route flits in the network. In contrast, our schemes degrade gradually and only start to worsen
their QoS towards the end of their lifetime (years 8.4-10).

5.3 Performance
Figure 16 shows the performance measured as the average
flit latency. Our schemes perform better than VICIS by extending the fault free execution of NoC routers. In contrast,
VICIS utilizes bare bone routers with lower performance as
soon as it encounters a defect in the crossbar switch. The
best improvement is on benchmark vips which is 71% using

Authorized licensed use limited to: Utah State University. Downloaded on November 11,2022 at 18:59:23 UTC from IEEE Xplore. Restrictions apply.

VICIS

FRavg

FRmax

LRavg

LRmax

CoV

Cycles

30
25
20
15
10
5
0

l
t
s
s
e
k
ole ytrac nnea ferre imat ption
ch
a
n
s
a
d
c
a
k
c
bo
sw
id
bla
flu

s

vip

x2

64

LRavg

0.63
0.58
0.53
0.48
0.43
0.38
0

1

FRavg

2

3

VICIS

4

5

6

7

8

9

10

Year
Figure 13: QoS:Jitter (Lower is better).

VICIS

FRavg

FRmax

LRavg

Figure 15: QoS:Blackscholes CoV (Lower is better).

LRmax

VICIS

CoV

4
3.5
3
2.5
2
1.5
1
0.5
0

FRmax

LRavg
312

LRmax
251

717

s

vip

x2

64

g

av

Figure 14: QoS:CoV (Lower is better).
the LR avg scheme. VICIS’s performance on swaptions and
vips is more than 2× slower. Upon a closer look, we find
that these performance penalties stem from particular traffic
patterns in these benchmarks, which overly exercise routing
paths with routers operating at a reduced functionality (bare
bone version) and lacking the ability to route flits in an efficient manner. Other benchmarks show a fairly constant traffic throughout the simulation. In all programs, LR performs
better compared to FR by dynamically avoiding congested
paths during high traffic. On an average, FR and LR outperform VICIS by 39% and 43%, respectively.

RELATED WORK

Emerging systems, with hundreds of billions of transistors,
are likely to have many faults, even at the point of tape-out
[26]. These faults can impact both processing cores as well
as NoC components, drastically reducing their functionality.
Many recent NoC works target these faulty components, and
outline a plethora of techniques to tolerate faults and sustain
successful communication between two nodes [27–34].
In the realm of exascale computing, most research on exascale NoCs have revolved around performance, energy efficiency, QoS and scalability. Moscibroda et al. explore the
use of bufferless routers for a scalable and energy-efficient
NoC network [35]. Abeyratne et al. explore scalable, asymmetrical high-radix NoC topologies for use in a Kilo-Core
systems [36]. Grot et al. study scalable QoS incorporation
in exascale systems [2]. However, system resiliency is also
fast becoming a first-class design constraint as we continue to
transcend Moore’s law. Bhardwaj et al. explore aging aware
routing in NoCs, but do not consider exascale systems and
QoS implications [16, 37]. Similarly, several recent works explore fault-tolerant routing in small-medium scale NoCs, but
do not consider exascale NoCs. One of the most prominent
works is VICIS by Fick et al. that provides a complete infrastructure for network recovery in the face of component failures [4]. To the best of our knowledge, ours is the first work
to clearly uncover the reliability tension in exascale NoCs,

Latency

200

s
s
k
al
et
te
ole rac
ne ferr ima ption
n wa
sch odyt can
a
k
c
d
b
s
i
bla
flu

6.

FRavg

250

150
100
50
0

s
k
al
ole ytrac
ne
sch
d
can
k
o
c
b
bla

ns
ret mate
tio
fer
ni
ap
a
w
d
s
i
flu

s

vip

x2

64

e

av

Figure 16: Latency (Lower is better).

caused by scaling and supporting QoS.

7. CONCLUSION
This paper explores the tradeoffs of increasing system reliability while enforcing QoS guarantees in an NoC. We introduce two novel DWRR algorithms, FR and LR. Our proposed
algorithms are guided by a low-cost NHM and a broadcastbased profiling and configuration. Our proposed schemes
boost the NoC MTTF on top of VICIS by 25%. Throughout an NoC’s lifetime, we also demonstrate an average of
16% QoS improvement by prolonging the lifetime of the high
performance routers through efficient redistribution of QoSinduced traffic stress across the network.

Acknowledgments
We thank our anonymous reviewers for their helpful suggestions on improving the paper. This work was supported in
part by National Science Foundation grants (CNS-1117425,
CAREER-1253024, CCF-1318826, CNS-1421022, CNS-1421068)
and donation from the Micron Foundation. Any opinions,
findings, and conclusions or recommendations expressed in
this material are those of the author(s) and do not necessarily
reflect the views of the NSF.

8. REFERENCES
[1] J. Wawrzynek, D. Patterson, M. Oskin, S.-L. Lu,
C. Kozyrakis, J. Hoe, D. Chiou, and K. Asanovic,
“Ramp: Research accelerator for multiple processors,”
Proc. of MICRO, pp. 46–57, 2007.
[2] B. Grot, S. W. Keckler, and O. Mutlu, “Preemptive
virtual clock: a flexible, efficient, and cost-effective qos
scheme for networks-on-chip,” in Proc. of MICRO,
pp. 268–279, 2009.

Authorized licensed use limited to: Utah State University. Downloaded on November 11,2022 at 18:59:23 UTC from IEEE Xplore. Restrictions apply.

[3] J. Lee, M. C. Ng, and K. Asanovic,
“Globally-synchronized frames for guaranteed
quality-of-service in on-chip networks,” in Proc. of
ISCA, pp. 89–100, 2008.
[4] D. Fick, A. DeOrio, J. Hu, V. Bertacco, D. Blaauw, and
D. Sylvester, “Vicis: a reliable network for unreliable
silicon,” in Proc. of DAC, pp. 812–817, 2009.
[5] J. Srinivasan, S. Adve, P. Bose, and J. Rivers, “Lifetime
reliability: toward an architectural solution,” Proc. of
MICRO, pp. 70–80, 2005.
[6] W. Wang, V. Reddy, A. Krishnan, R. Vattikonda,
S. Krishnan, and Y. Cao, “Compact modeling and
simulation of circuit reliability for 65-nm cmos
technology,” IEEE Trans. on Device and Materials
Reliability, vol. 7, no. 4, pp. 509 –517, 2007.
[7] A. Basu, J. Gandhi, J. Chang, M. D. Hill, and M. M.
Swift, “Efficient virtual memory for big memory
servers,” in Proc. of ISCA, pp. 237–248, 2013.
[8] P. Lotfi-Kamran, B. Grot, and B. Falsafi, “Noc-out:
Microarchitecting a scale-out processor,” in Proc. of
MICRO, pp. 177–187, 2012.
[9] D. Fick, N. Liu, Z. Foo, M. Fojtik, J. sun Seo,
D. Sylvester, and D. Blaauw, “In situ delay-slack
monitor for high-performance processors using an
all-digital self-calibrating 5ps resolution time-to-digital
converter,” in ISSCC, pp. 188–189, 2010.
[10] S. Das, C. Tokunaga, S. Pant, W.-H. Ma, S. Kalaiselvan,
K. Lai, D. Bull, and D. Blaauw, “RazorII: In situ error
detection and correction for PVT and SER tolerance,”
JSSC, vol. 44, pp. 32–48, Jan. 2009.
[11] Open Source NoC Router RTL. https://nocs.stanford.
edu/cgi-bin/trac.cgi/wiki/Resources/Router.
[12] W. J. Dally and B. Towles, Principles and practices of
interconnection networks. Morgan Kaufmann, 2004.
[13] P. Gratz, B. Grot, and S. W. Keckler, “Regional
congestion awareness for load balance in
networks-on-chip,” in HPCA, pp. 203–214, 2008.
[14] M. Ramakrishna, P. V. Gratz, and A. Sprintson, “Gca:
Global congestion awareness for load balance in
networks-on-chip,” in NOCS, pp. 1–8, IEEE, 2013.
[15] W. Zhao and Y. Cao, Predictive Technology Model.
http://ptm.asu.edu/.
[16] K. Bhardwaj, K. Chakraborty, and S. Roy, “Towards
graceful aging degradation in nocs through an adaptive
routing algorithm,” in Proc. of DAC, pp. 382–391, 2012.
[17] C. Sun, C.-H. Chen, G. Kurian, L. Wei, J. Miller,
A. Agarwal, L.-S. Peh, and V. Stojanovic, “Dsent - a tool
connecting emerging photonics with electronics for
opto-electronic networks-on-chip modeling,” in NOCS,
pp. 201–210, 2012.
[18] P. Kundu, “On-die interconnects for next generation
cmps,” in Proc. of WOCIN, 2006.
[19] D. M. Ancajas, K. Chakraborty, and S. Roy, “Hci
tolerant noc router micro-architecture,” in Proc. of DAC,
no. 40, 2013.
[20] D. Lorenz, M. Barke, and U. Schlichtmann, “Aging
analysis at gate and macro cell level,” in Proc. of

ICCAD, pp. 77–84, 2010.
[21] M. Kamal, M. P. Qing Xie, A. Afzali-Kusha, and
S. Safari, “An efficient reliability simulation flow for
evaluating the hot carrier injection effect in cmos vlsi
circuits,” in ICCD, pp. 352–357, 2012.
[22] Seyab and S. Hamdioui, “Nbti modeling in the
framework of temperature variation,” in Proc. of DATE,
pp. 283 –286, 2010.
[23] J. Hestness, B. Grot, and S. W. Keckler, “Netrace:
dependency-driven trace-based network-on-chip
simulation,” in Proc. of WNOCA, pp. 31–36, 2010.
[24] W. Huang, M. R. Stan, K. Skadron,
K. Sankaranarayanan, S. Ghosh, and S. Velusamy,
“Compact thermal modeling for temperature-aware
design,” in Proc. of DAC, pp. 878–883, 2004.
[25] E. A. H. El Amir, “On uses of mean absolute deviation:
decomposition, skewness and correlation coefficients,”
Metron, vol. 70, no. 2-3, pp. 145–164, 2012.
[26] S. Borkar, “Thousand core chipsa technology
perspective,” in Proc. of DAC, pp. 746–749, 2007.
[27] C.-L. Chou and R. Marculescu, “Farm: Fault-aware
resource management in noc-based multiprocessor
platforms,” in Proc. of DATE, pp. 673–678, 2011.
[28] A. Hosseini, T. Ragheb, and Y. Massoud, “A
fault-aware dynamic routing algorithm for on-chip
networks,” in Proc. of ISCAS, pp. 2653–2656, 2008.
[29] F. Chaix, D. Avresky, N.-E. Zergainoh, and
M. Nicolaidis, “A fault-tolerant deadlock-free adaptive
routing for on chip interconnects,” in Proc. of DATE,
pp. 909–912, 2011.
[30] Y.-C. Lan, M. Chen, W.-D. Chen, S.-J. Chen, and Y.-H.
Hu, “Performance-energy tradeoffs in reliable nocs,” in
Quality of Electronic Design, 2009. ISQED 2009. Quality
Electronic Design, pp. 141–146, 2009.
[31] R. Parikh and V. Bertacco, “Formally enhanced runtime
verification to ensure noc functional correctness,” in
Proc. of MICRO, pp. 410–419, 2011.
[32] A. Prodromou1, A. Panteli1, C. Nicopoulos1, and
Y. Sazeides2, “Nocalert: An on-line and real-time fault
detection mechanism for network-on-chip
architectures,” in Proc. of MICRO, pp. 60–71, 2012.
[33] Z. Zhang, A. Greiner, and S. Taktak, “A reconfigurable
routing algorithm for a fault-tolerant 2d-mesh
network-on-chip,” in Proc. of DAC, pp. 441–446, 2008.
[34] W.-C. Tsai, D.-Y. Zheng, S.-J. Chen, and Y. H. Hu, “A
fault-tolerant noc scheme using bidirectional channel,”
in Proc. of DAC, pp. 918–923, 2011.
[35] T. Moscibroda and O. Mutlu, “A case for bufferless
routing in on-chip networks,” in Proc. of ISCA,
pp. 196–207, 2009.
[36] N. Abeyratne, R. Das, Q. Li, K. Sewell, B. Giridhar,
R. Dreslinski, D. Blaauw, and T. Mudge, “Scaling
towards kilo-core processors with asymmetric high
radix topologies,” in HPCA, pp. 89–101, 2013.
[37] K. Bhardwaj, K. Chakraborty, and S. Roy, “An milp
based aging aware routing algorithm for nocs,” in
Proc. of DATE, pp. 326–331, 2012.

Authorized licensed use limited to: Utah State University. Downloaded on November 11,2022 at 18:59:23 UTC from IEEE Xplore. Restrictions apply.

