A low-overhead soft-hard fault-tolerant architecture, design and
  management scheme for reliable high-performance many-core 3D-NoC systems by Dang, Khanh N et al.
Journal of Supercomputing manuscript No.
(will be inserted by the editor)
A Low-overhead Soft-Hard Fault Tolerant Architecture, Design and
Management Scheme for Reliable High-performance Many-core
3D-NoC Systems
Khanh N. Dang · Michael Meyer · Yuichi Okuyama · Abderazek Ben Abdallah
The final publication is available at Springer via https://doi.org/10.1007/s11227-016-1951-0
Abstract The Network-on-Chip (NoC) paradigm has been
proposed as a favorable solution to handle the strict commu-
nication requirements between the increasingly large num-
ber of cores on a single chip. However, NoC systems are
exposed to the aggressive scaling down of transistors, low
operating voltages, and high integration and power densi-
ties, making them vulnerable to permanent (hard) faults and
transient (soft) errors. A hard fault in a NoC can lead to ex-
ternal blocking, causing congestion across the whole net-
work. A soft error is more challenging because of its silent
data corruption, which leads to a large area of erroneous data
due to error propagation, packet re-transmission, and dead-
lock. In this paper, we present the architecture and design of
a comprehensive soft error and hard fault tolerant 3D-NoC
system, named 3D-Hard-Fault-Soft-Error-Tolerant-OASIS-
NoC (3D-FETO)1. With the aid of efficient mechanisms and
algorithms, 3D-FETO is capable of detecting and recovering
from soft errors which occur in the routing pipeline stages
and leverages reconfigurable components to handle perma-
nent faults in links, input buffers, and crossbars. In-depth
evaluation results show that the 3D-FETO system is able to
work around different kinds of hard faults and soft errors,
ensuring graceful performance degradation, while minimiz-
ing additional hardware complexity and remaining power-
efficient.
Khanh N. Dang · Micheal Meyer · Yuichi Okuyama · Abderazek Ben
Abdallah
Adaptive Systems Laboratory
Graduate School of Computer Science and Engineering
The University of Aizu
Aizu-Wakamatsu City, Fukushima 965-8580, Japan
E-mail: d8162103, benab@u-aizu.ac.jp
1 This project is partially supported by Competitive Research Fund-
ing (CRF), The University of Aizu, Reference P-11 (2016), and JSPS
KAKENHI Grant Number JP30453020
Keywords 3D NoCs · Fault-tolerance · Soft-Hard Faults ·
Reliability · Architecture · Design
1 Introduction
Global interconnects are becoming the principal performance
bottleneck for high performance Systems-on-Chip (SoCs) [2].
The 3-dimensional Networks-on-Chip (3D-NoCs) have been
proposed as a promising architecture that combines the high
parallelism of Network-on-Chip paradigm with the high per-
formance and lower interconnect power of 3-dimensional
integration circuits (3D-ICs) [6]. In the past few years, the
benefits of 3D Integrated Circuits (3D-ICs) and mesh-based
Network-on-Chips (NoCs) have been fused into a promising
architecture opening a new horizon for IC design. The par-
allelism of NoCs can be enhanced in the third dimension
thanks to the short wire length and low power consump-
tion of the interconnects of 3D-ICs. As a result, the 3D-
NoC paradigm is considered to be one of the most advanced
and auspicious architectures for the future of IC design, as
it is capable of providing extremely high bandwidth and low
power interconnects.
While the NoC paradigm has been increasing in popu-
larity with several commercial chips [3], it is threatened by
the decreasing reliability of aggressively scaled transistors.
Transistors are approaching the fundamental limits of scal-
ing. Gate widths are nearing the molecular scale, resulting in
breakdown and wear out in end products [19,23]. Moreover,
the anticipated fabrication geometry in 2018 scales down to
8nm with a projected 0.6V supply voltage [22]. In the 8nm
process, a higher rate of soft errors affect control logic and
buffers of NoC routers, leading to chip failure. In addition,
the low supply voltage enforces a very narrow noise mar-
gin, which makes the architecture vulnerable and sensitive
to faults. As reported in [16], the soft error rate increases
ar
X
iv
:2
00
3.
11
01
8v
1 
 [c
s.A
R]
  2
1 M
ar 
20
20
2 Khanh N. Dang et al.
Errors in Network-on-Chip
Soft Errors
Transient faults:
– Cross-talk
– Radiation particles
– Cosmic rays
– Thermal neutrons
– Noise
Hard Faults
Run-time issues:
– Time Dependent Di-
electric Breakdown
– Electro Migration
– Thermal Stress
– Negative-Bias Tem-
perature Instability
Manufacturing defects:
– Open
– Stuck at 0.
– Stuck at 1.
– Bridge
Fig. 1 Taxonomy of errors and faults in NoCs.
about 30% for each 100 mV decrease in the supply volt-
age. With rising power density and non-ideal threshold and
supply voltage scaling, soft errors have become increasingly
common during a chip’s lifetime [17]. Figure 1 shows a de-
tailed taxonomy of different types of error and fault sources
in NoCs. We categorized the faults into two classes: Hard
Faults and Soft Errors.
Hard faults, including both permanent faults and inter-
mittent faults, can occur during the manufacturing stage or
under specific operating circumstances. Intermittent faults
periodically occur during operation and can disappear af-
ter a certain time. Because these faults do not permanently
damage a given component, it can pass through several test-
ing stages, but can still cause operation failures. Although
intermittent faults can disappear after a specific period of
time, their inconsistency can be treated as permanent faults
to avoid complex situations. For both permanent and inter-
mittent faults, the most natural solution is using redundant
components [15,12].
Soft errors arise from energetic particles, such as alpha
particles and neutrons from cosmic rays, generating electron-
hole pairs as they pass through a device. A sufficient amount
of accumulated charge may invert the state of a logic device
such as a: latch, gate, or SRAM cell; thereby introducing a
logic fault into the NoC’s operation. Soft errors do not per-
manently defect the gate and only occur over a short period
of time. Because of their special characteristics, they are un-
predictable and unavoidable. Unlike permanent and inter-
mittent faults, transient faults cannot be fixed by replacing
the affected components. Instead, they can be recovered by
repeating the erroneous operation. A transient failure inside
the data path can also be fixed by using code-based tech-
niques (e.g., Error Correction Code (ECC) [8]). Statistically,
transient faults are the most common kind of fault account-
ing for 80% of failures, as reported in [24]. Therefore, with-
out an efficient protection mechanism, these errors can com-
promise the system’s functionality and reliability.
Hard fault handling schemes are based on two main ap-
proaches: (a) fault-tolerant routing algorithms, which en-
able packets to avoid faulty nodes in the network [6,15];
(b) architecture-based methods, which use hardware (com-
ponents) redundancy and/or reconfiguration to recover from
faults [15,12,1]. Soft error recovery is also solved by two
main schemes: (a) data corruption handling using Error Cor-
rection Code (ECC) based methods [26,8,38] ; (b) control
logic handling using temporal redundancy based methods [18,
39,14].
Although many researchers have proposed solutions for
various individual aspects of on-chip reliability, a compre-
hensive approach encompassing both soft errors and hard
faults pertaining to NoC reliability has yet to evolve. In addi-
tion, the error detection and diagnosis in NoC architectures
has been studied thoroughly in the scope of offline testing;
however, with soft errors and intermittent faults becoming a
dominant failure mode in modern NoCs and general VLSI
systems, a widespread deployment of online test approaches
has become crucial. In this paper, we present a comprehen-
sive soft error and hard fault tolerant 3D-NoC architecture,
named 3D-Hard-Fault-Soft-Error-Tolerant-OASIS-NoC (3D-
FETO). With the aid of efficient mechanisms and algorithms,
3D-FETO is capable of detecting and recovering from soft
errors occurring in the routing pipeline stages and leverages
reconfigurable components to handle permanent fault occur-
rences in links, input-buffers, and crossbars. The main con-
tributions of this work are summarized as follows:
– A new adaptive 3D router architecture based on a robust
hardware reconfiguration mechanism of the most sus-
ceptible components to hardware faults, and on a low-
cost method that is capable of detecting and recovering
from soft errors in the router pipeline stages.
– An efficient scheme for online control fault detection and
diagnosis in 3D-NoC systems.
The organization of this paper is as follows: in Section 2,
we present related works. Section 3 presents the adaptive
router architecture (SHER-3DR). In Section 4, we present
comprehensive techniques which include fault detection, di-
agnosis and recovery. Section 5 provides the implementa-
tion and evaluation results. Finally, we present the conclu-
sion and our ideas for future work in the last section.
2 Related Works
A lot of works have addressed the fault-tolerance and reli-
ability issues in NoC architectures. In [6,1,7], we covered
some well-known solutions presented to tackle hard faults;
therefore, in this section we mainly focus on solutions re-
lated to soft error recovery. As depicted in Table 1, they are
classified into methods focusing on the Data Path (DP) and
methods focusing on the Control Logic (CL) of the router.
For soft errors in the data path, most works use code-
based techniques that not only detect the integrity of the
A Low-overhead Soft-Hard Fault-Tolerant Architecture and Management Scheme for Reliable... 3
Table 1 Taxonomy of different error recovery protocols and architectures in NoCs.
Fault Type Position/Method Fault Tolerant Method
Soft Errors
Data Path Automatic Re-transmission Request [26]Error Detecting/Correcting Code [8,38]
Control Logic
Logic/Latch Hardening [18,32]
Pipeline Redundancy [14]
Monitoring and Correcting model [39,31,29]
Hard Faults
Routing Technique
Spare wire [25,35]
Split transmission [20]
Fault-Tolerant routing algorithm [6,15]
Architecture-based Technique Hardware Redundancy [12]Reconfiguration architectures [15]
received data, but also provide a correction function up to
an acceptable number of faults. For instance, Bertozzi et
al. [8] analyzed several low cost coding techniques for on-
chip communication. Among these techniques, SECDED (Single-
Error Correcting and Double-Error Detecting) was found to
be the solution with the most balanced trade-off between re-
liability and implementation cost. Although the authors pro-
vide several evaluations of energy and hardware complexity,
on-chip communication analysis (such as throughput and
latency) is missing. As an adaptive solution, Yu et al. [38]
presented a dynamic ECC based on quality of wire connec-
tion by using a configurable ECC with two Hamming codes
to adapt with several probabilities of faults. Although this
adaptive ECC obtains energy efficiency, its area overhead is
problematic.
Soft errors can be detected and recovered using temporal
redundancy. For example, Ernst et al. [18] presented a Razor
D Flip-flop with an additional shadow latch sampled by a de-
layed clock for checking the occurrence of transient faults.
Furthermore, a soft error detection solution based on redun-
dant latches was also presented by Ravindan et al. [34]. Al-
though these techniques obtain more efficient detection re-
sults, they nearly double the area overhead and power con-
sumption to maintain the redundant latches.
For soft errors in the control logic, there are several tech-
niques with cross-layer resolution. In the End-to-End level,
Shamshiri et al. [36] proposed error-correction and on-line
diagnosis using a specific code named 2G4L. Based on the
position of the erroneous bit in the received data, the sys-
tem can indicate the position of the faulty node in the net-
work; however, when a packet is misrouted due to wrong
routing information/arbitration or an adaptive routing algo-
rithm, the path of a packet is not fixed in a way that can
determine the faulty node. To ensure arbitration computa-
tion across layers, NoCAlert [31] implements constraints to
obtain computational accuracy. By constraining the relation-
ship between the input and output of a block, the system can
detect both soft and hard faults. Although this work presents
efficient detection, it lacks efficiency in recovering from soft
errors. First, the system needs to distinguish between soft
and hard faults to decide the recovery method. Second, soft
errors cannot be recovered by spatial redundancy and their
recovery in the End-to-End level is inefficient. The FoReVer
framework [29] also presented a network level method to
detect and recover from routing errors: lost, duplicated, and
misrouted packets. Since FoReVer is based on End-to-End
detection and recovery, dealing with soft errors requires re-
transmission of the whole packet instead of an online recov-
ery.
In the physical/data-link layers, one of the most com-
mon methods is using Triple Modular Redundancy (TMR).
By triplicating the original module, the system gets three
results at the same time [32]. The three results are sent to
a Majority Voting module to decide the accurate result. Al-
though this technique suffers from high area overhead and
power consumption (about 300%), it is easy to implement
and effective for both soft errors and hard faults. In [39],
the authors deploy a monitoring system on important con-
trol modules. They can diagnose the output to find the fail-
ure. This technique is light-weight in both area and power
and has an insignificant impact on the system performance.
However, it suffers from lack of flexibility since the moni-
tor module has to be specifically designed depending on the
target component. If any changes in the routing algorithm or
pipeline stages are needed, investigation and re-designing of
the monitor module is mandatory.
3 Adaptive 3D Router Architecture (SHER-3DR)
Figure 2 shows the block diagram of the proposed adaptive
3D router architecture (SHER-3DR). The router relies on
simple recovery techniques based on system reconfiguration
with redundant structural resources to contain hard faults in
the input-buffers, crossbar, and links, in addition to soft er-
rors in the routing pipeline stages.
The SHER-3DR router is the backbone component of
the 3D-FETO system. Each router has a maximum of 7-
input and 7-output ports, where 6 input/output ports are ded-
icated to the connection to the neighboring routers and one
input/output port is used to connect the switch to the local
4 Khanh N. Dang et al.
local input_port
down input_port
Switch Allocator
Crossbar
A
R
Q
b
u
ffer
input buffer
request
input port manager
NPC
44
ECC
arq_out
ar
q
_i
n
44x7
down-in
up-in
west-in
south-in
east-in
north-in
local-in
fault_managerprev_node next_node
SER
-m
an
ager
Monitor
stop_out
RAB
BYPASS LINK - 1
BYPASS LINK - 2
Arbiter
Stall/Go
Controller st
o
p
_i
n
north input_port
east input_port
south input_port
west input_port
up input_port
44
44
44
44
44
44
… … d
at
a_
o
u
t
7
Fig. 2 Adaptive 3D router (SHER-3DR) architecture.
computation tile. As shown in Fig. 2, the SHER-3DR con-
tains seven Input-port modules for each direction in addition
to the Switch-Allocator, and the Crossbar module which
handles the transfer of flits to the next node. An Input-port
module is composed of two main elements: an Input-buffer
and the LAFT routing (Next-Port-Computing) module. In-
coming flits from different neighboring routers, or from the
connected computation tile, are first stored in the Input-buffer.
This step is considered to be the first pipeline stage of the
flit’s life-cycle, Buffer-Writing (BW). After receiving and
storing the flits, their routing information is read and pro-
cessed by a LAFT-Routing module (Next-Port-Computing)
and an arbitrating module (Switch-Allocator). This step is
the second stage - Next-Port-Computing/Switch-Allocator
(NPC/SA). After the NPC/SA pipeline stage, the next-port
value is merged into the flit and the grant signal allows the
flit to traverse from its input port to an output port (Crossbar-
Traversal (CT) stage).
An augmented Look-Ahead-Fault-Tolerant routing algo-
rithm (LAFT) [4,5] is used to perform the routing decision.
If a given flit is routed to the local port, there is no rout-
ing calculation. If the flit is to be routed to another node,
the fault link information of all neighboring nodes is read
by each input-port and LAFT routing is executed. The first
phase of the algorithm is calculating the next node’s address
and its fault output information. In the next phase, the LAFT
routing algorithm determines the minimal paths which are
valid for routing after eliminating the faulty paths. The final
routing path is selected by evaluating two factors of all the
possible routing paths: (1) the diversity of the routing path
to the destination node and (2) the congestion value of the
connection. If there is no minimal routing path, a similar ap-
proach is applied for the non-minimal routing paths. Finally,
an output port of the selected routing is calculated. This in-
formation is merged in the flit as next-output-port bits for
routing in following nodes [1].
3.1 Hard Fault Recovery Mechanism Overview
The block diagram of the hard fault recovery mechanism
is shown in Fig. 3. The Random Access Buffer mechanism
(RAB) [1] solves the deadlock problem that can occur with
the look-ahead fault-tolerant routing algorithm (LAFT), and
is able to recover from transient, intermittent, and perma-
nent faults in the input-buffer. When a fault is detected in
one of the slots, the main controller (located in input port
manager in Fig. 2) considers the flagged slots when assign-
ing the write and read addresses. It remains to check the
flagged slots for recovery from the faults.
The Bypass Link on Demand mechanism (BLoD) [1]
provides additional escape channels whenever the number of
A Low-overhead Soft-Hard Fault-Tolerant Architecture and Management Scheme for Reliable... 5
Bypass-1
Ctrl 
Fault-control-module (FCM)
Bypass-2
Fa
u
lt
y_
C
ro
ss
En
ab
le_
b
yp
ass
d
isab
le_
crss
L_in
N_in
E_in
S_in
W_in
U_in
D_in
L_out
N_out
E_out
S_out
W_out
U_out
D_out
Bypass-3
(a) (b)
Fig. 3 Hard-fault tolerant mechanism [1]: (a) Random Access Buffer (RAB); (b) Bypass-Link-on-Demand (BLoD)
faults in the baseline 7x7 crossbar increases. When a fault is
detected in one or several crossbar links, the fault manager
(depicted in Fig. 2) disables the faulty crossbar links and en-
ables the appropriate number of bypass channels. The num-
ber of Bypass-links is very important and it should be min-
imized as much as possible to reduce the area and power
overhead. In a case where the number of faulty links is larger
than the number of backup links, the system needs to mark
the links as faulty and use the LAFT algorithm to avoid rout-
ing through this defective connection.
3.2 Soft Error Recovery Mechanism
As represented in Fig. 4, the principal soft-error handling
method in the proposed 3D-FETO system relies on a solu-
tion called Pipeline Computation Redundancy (PCR) in one
more clock cycle [14].
For ease of understanding, we explain the PCR in Algo-
rithm 1. The Next Port Computing (NPC) and Switch Al-
locator (SA) run in parallel (line 2,3) after the Buffer Writ-
ing stage. This is achieved by the LAFT routing algorithm,
where the dependency between the two stages is eliminated.
After the first computation, both of the two stages have an
additional computation clock cycle (line 4, 5). By compar-
ing two consecutive results, soft errors will be detected. If
a soft error is detected, the whole pipeline is halted for cor-
rection. A third computation is required for majority voting,
which decides the final result. To recover from soft errors
in the data, Single Error Correction Double Error Detection
(SECDED) [21] with ARQ (Automatic Retransmission Re-
quest) [26] is adopted.
In the first stage, flits are stored in the input buffer at the
Buffer Writing (BW) stage, and the ECC is used to check
and correct the input data in the ECC module. In second
stage, the NPC and the SA are executed in parallel in the
LAFT routing unit and the Switch-Allocator module. In third
stage, the Redundant NPC (RNPC) and the Redundant SA
(RSA) are computed in parallel. Then, if the output of RNPC
is equal to that of NPC, and SA is equal to RSA, the Cross-
bar Traversal (CT) stage is performed in the third cycle, and
the flit goes to the next router via the output channel. If the
RNPC is not equal to the NPC, the system rolls-back and re-
computes the NPC. Moreover, if SA is not equal to RSA, the
system also rolls-back and re-computes the SA stage. Af-
ter rolling-back and re-computing, a majority voting module
is used to decide the correct output of these modules. The
rolling-back, re-computing and voting are executed. Then,
the outputs of NPC/SA are sent to the Crossbar Traversal
stage to finish the flit transmission.
Figure 5 presents a working demonstration of the SHER-
3DR router. [ f lit(n)] represents the flit in the nth position of
the packet. [time(m)] illustrates the mth time of computation.
In the first clock cycle, BW handles [ f lit(1)] while NPC/SA
and CT are idle or are handling another packet. In the second
cycle, NPC/SA computes [ f lit(1), time(1)], which means the
computation of the first flit for the first time. In the third cy-
cle, NPC/SA computes [ f lit(1), time(2)], which means that
it computes the first flit for the second time, also known
as the redundant computation. [c(1)] compares the results
6 Khanh N. Dang et al.
BW NPC/SA CT
Local
Input-port
North
Input-port
East
Input-port
West
Input-port
South
Input-port
Up
Input-port
Down
Input-port
Sw
it
ch
A
ll
o
ca
to
r
C
ro
ss
b
ar
Ta
il
Se
n
t
data_out_L
stop_in_L
data_out_N
stop_in_N
data_out_E
stop_in_E
data_out_W
stop_in_W
data_out_S
stop_in_S
data_out_U
stop_in_U
data_out_D
stop_in_D
Input 
Buffer
NPC
Input port 
manager
d
at
a_
in
Arbiter
Stall/Go
Controller
to
_c
ro
ss
b
ar
grant
Soft-Error 
Monitor
cr
o
ss
b
ar
_c
tr
l
M U X
M U X
EC
C
ar
q
_
o
u
t
request
RAB
P
C
R
 m
an
ag
er
(d) (e)
(a)
data_in_L
stop_out_L
data_in_N
stop_out_N
data_in_E
stop_out_E
data_in_W
stop_out_W
data_in_S
stop_out_S
data_in_U
stop_out_U
data_in_D
stop_out_D 1
44
1
44
1
44
1
44
1
44
1
44
1
44 11
1
1
1
1
1
44
44
44
44
44
44
44
30 31 32 33
20 21 22 32
10 11 21 31
00 01 02 03
30 31 32 33
20 21 22 32
10 11 21 31
00 01 02 03
30 31 32 33
20 21 22 32
10 11 21 31
00 01 02 03
30 31 32 33
20 21 22 32
10 11 21 31
00 01 02 03
R
NI
UP
DOWN
EASTWEST
NORTH
SOUTH
PE
(b)
(c)
Fig. 4 High-level view of the soft-hard error recovery approach: (a) 3D-Mesh based NoC configuration; (b) Tile organization; (c) SHER-3DR
router organization; (d) Input-Port; (e) Switch allocation unit.
A Low-overhead Soft-Hard Fault-Tolerant Architecture and Management Scheme for Reliable... 7
Algorithm 1: Algorithm of Pipeline Computation Re-
dundancy (PCR).
// input flit’s data
Input: in flit
// output flit’s data
Output: out flit
// Write flit’s data into buffers
1 BufferWriting(in flit)
// Compute first time of NPC and SA
2 next port[1] = NextPortComputing(in flit)
3 grants[1] = SwitchAllocation(in flit)
// Compute redundant of NPC and SA
4 next port[2] = NextPortComputing(in flit)
5 grants[2] = SwitchAllocation(in flit)
// Compare orginal and redundant to detect soft-error
// Soft-error on NPC
6 if (next port[1] , next port[2]) then
// roll-back and recalculate NPC
7 next port[3] = NextPortComputing(in flit)
8 final next port = MajorityVoting(next port[1,2,3]);
9 else
// No soft-error on NPC
10 final next port = next port[1]
11 end
// Soft-error on SA
12 if (grants[1] , grants[2]) then
// roll-back and recalculate SA
13 grants[3] = SwitchAllocation(in flit)
14 final grants = MajorityVoting(grants[1,2,3])
15 else
// No soft-error on SA
16 final grants = grants[1]
17 end
// After detection and recovery, the algorithm finishes
with CT
18 out flit = CrossbarTraversal(in flit, final next port, final grants);
Cycle BW NPC/SA CT
1st 𝑓𝑙𝑖𝑡(1) 𝑖𝑑𝑙𝑒 𝑖𝑑𝑙𝑒
2nd 𝑓𝑙𝑖𝑡(2) 𝑓𝑙𝑖𝑡 1 , 𝑡𝑖𝑚𝑒(1)
𝑖𝑑𝑙𝑒
3rd 𝑓𝑙𝑖𝑡(3) 𝑓𝑙𝑖𝑡 1 , 𝑡𝑖𝑚𝑒(2) → 𝑐(1) 𝑓𝑙𝑖𝑡 1 , 𝑡𝑖𝑚𝑒(1)
4th : 𝑐 1 = 𝑇 𝑓𝑙𝑖𝑡(4) 𝑓𝑙𝑖𝑡(2) 𝑖𝑑𝑙𝑒
4th : 𝑐 1 = 𝐹 𝑓𝑙𝑖𝑡(4) 𝑓𝑙𝑖𝑡 1 , 𝑡𝑖𝑚𝑒(3) → 𝑓(1) 𝑓𝑙𝑖𝑡 1 , 𝑡𝑖𝑚𝑒(2)
𝑓𝑙𝑖𝑡(𝑛): flit 𝑛𝑡ℎ in packet.
𝑡𝑖𝑚𝑒 𝑚 : computation at 𝑚𝑡ℎ time.
𝑐(𝑎): flit 𝑎𝑡ℎ comparison. 𝑇 = 𝑇𝑟𝑢𝑒; 𝐹 = 𝐹𝑎𝑙𝑠𝑒
𝑓(𝑎): flit 𝑎𝑡ℎ finalization based on majority voting. 
conditional
branches
Input direction
First Cycle
Second Cycle
Recovery Cycle
Conditional direction
Fig. 5 SHER-3DR working demonstration.
of [ f lit(1), time(1)] and [ f lit(1), time(2)] to detect the oc-
currence of a soft error. If there is no error, CT processes
[ f lit(1), time(1)] to finish the pipeline stages of the first flit.
If there is an error in NPC/SA, the system requires the recov-
ery in the fourth cycle. In this cycle, NPC/SA recalculates
the first flit for the third time for recovery ([ f lit(1), time(3)])
and finalizes an accurate result by using majority voting ([ f (1)]).
After getting the final result of the first flit, CT completes
the pipeline stage of the first flit based on the correct re-
sult of the two previous computations: [ f lit(1), time(1)] or
[ f lit(1), time(2)]. As shown in Fig. 5, the router requires one
clock cycle for detecting a soft-error and one optional cycle
for recovering each time an error occurs.
4 Light-weight Detection, Diagnosis and Recovery
Mechanism (DDRM)
Algorithm 2 shows the proposed Detection, Diagnosis and
Recovery Mechanism (DDRM). It uses the feedback from
the ECC and the Automatic Retransmission Request (ARQ)
protocol to monitor the errors. As shown in Fig. 2, the input
data is first verified by an ECC decoder. If the value is cor-
rect or the ECC decoder can handle the correction, the flit
is written to the input buffer. Otherwise, a retransmission is
requested. Since the transient fault only occurs over a short
period of time, assumed to be a single clock cycle, it does
not occur for two consecutive cycles. Therefore, ARQ can
recover this kind of fault. However, if a permanent fault oc-
curs, ARQ is unable to correct it and the faulty connection
will keep requesting retransmission infinitely. Therefore, if
the ARQ cannot correct the fault, the system considers it to
be a permanent fault (line 1-10 in Algorithm 2).
Since a flit’s correctness is verified by the ECC module
before being written to the buffer, a permanent fault can only
occur in the path between the input-buffer in the upstream
node and the one in the downstream node. Figure 6 shows
the high-level view of the DDRM and Router-to-Router in-
terfacing. The transmission path of a flit consists of 3 main
components: input buffer slots, a crossbar link and a router-
to-router channel. When a fault is detected, DDRM diag-
noses these two components to find the fault position and
recover it with an appropriate mechanism.
For the diagnosis and recovery phase, the router’s Fault-
manager module initiates the diagnosis with input buffer
checking. In this step, the error statuses of the following
flits of the monitored input buffer are checked. If errors are
detected in the following flits’ transmission, it means the
fault should belong to the crossbar link or the inter-router
channel. The diagnosis is forwarded to check the crossbar
and inter-router channel. If errors are constantly detected
at the same position of the monitored buffer, the fault be-
longs to this detected position. In this fashion, the Fault-
manager sends a signal to the Random Access Buffer (RAB)
mechanism to indicate the faultiness of the slot in the input
buffer (line 11-14). If the fault-manager indicates that the
fault may belong to the crossbar or inter-router channel, the
Fault-manager first configures the Bypass-Link-on-Demand
(previously presented in Section 3.1) to establish an alter-
native connection path. Then, another flit is sent from the
input buffer through a bypass-link and the router-to-router
channel to the downstream node. If, at the downstream node,
the flit is found to be not faulty by the ECC module, the
8 Khanh N. Dang et al.
STOP
ARQ
Crossbar
Bypass-Links
Route
Computing
Switch
Allocator
Upstream Node
Input Port
STOP
ARQ
Crossbar
Bypass-Links
Route
Computing
Switch
Allocator
Downstream Node
Input Port
STOP
ARQ
Routers &
 IPs Ro
ut
er
s &
 IP
s
Routers & IPs
ECCECC
Soft-Error-
Resilience 
Technique
Bypass-Link-On-
Demand
Fault-Tolerant 
Routing
Fault-manager Fault-manager
Random-Access-
Buffer
1
2
3
1 Input Buffer Checking
3
2 Configure Bypass-Link-on-Demand
3 Fault-Tolerant Routing
Fig. 6 Router-to-Router interfacing and DDRM scheme.
Fault-manager concludes that the fault is in the Crossbar,
which is already handled by the BLoD mechanism. There-
fore, the configuration of the BLoD is kept as a recovery.
If the flit is still faulty, the fault belongs to the inter-router
channel. In this situation, the BLoD is released for further
fault-tolerance and the information of the faulty channel is
sent to the routing module (in LAFT algorithm). At the rout-
ing module, the Look-Ahead Fault-Tolerant routing algo-
rithm uses the fault information to handle the channel’s fail-
ure. The flit in the input buffer is re-routed via an alternative
output port.
5 Evaluation Results
5.1 Evaluation Methodology
The proposed 3D-FETO system was designed in Verilog-
HDL, synthesized and prototyped with commercial CAD
tools and VLSI technology, respectively [28,27]. We eval-
uate the hardware complexity of the SHER-3DR router in
terms of area utilization, power consumption (static and dy-
namic), and speed. To evaluate the performance of the pro-
posed system, we select both synthetic and realistic traffic
patterns as benchmarks. For synthetic benchmarks, we se-
lected Transpose [11], Uniform [37], Matrix-multiplication
[10,40], and Hotspot 10% [13]. For realistic benchmarks,
we chose H.264 video encoding system [33], Video Object
Plane Decoder (VOPD), Picture In Picture (PIP) and Multi-
ple Window Display (MWD) [9]. The simulation configura-
tions are depicted in Table 2.
The above synthetic benchmarks help us understand the
performance of the network under stress; however, we also
need several realistic benchmarks to understand the network
under real application traffic. Therefore, we build a simula-
tor in Verilog-HDL which allows us to set up the traffic pat-
terns from real applications. Based on the traffic patterns, the
Network Interfaces send and receive packets over the net-
works. We select a video encoding system using a H.264
encoder, a MP3 encoder, and a OFDM [33]. Moreover, we
select three applications [9]: VOPD, PIP and MWD.
We evaluate the performance of our fault-tolerant model
which includes hard fault tolerance from 3D-FTO [1], Soft-
Error Tolerance OASIS system, and the proposed system
(3D-FETO). We measure the average packet latency, with
the selected synthetic and realistic benchmarks. To under-
stand the impact of fault-tolerance techniques on performance,
we compare the obtained results with the baseline 3D-NoC
system presented in [4]. We randomly inject faults at three
fault-rates: 10%, 20% and 33%. The faults are injected into
hard fault tolerant and soft error tolerant modules. For the
soft error tolerant system, only soft errors are injected. For
the hard fault tolerant (3D-FTO) system, only hard faults
are injected. For the final system (3D-FETO), both soft er-
rors and hard faults are injected. Hard faults are injected at
the beginning of simulation and their rate is measured as
the percentage of routers with faults. Soft errors are injected
during the system’s operation and their rate is considered to
be the number of soft errors per clock cycle. The injected
fault rates are considered individually for each error type.
5.2 Complexity Evaluation
In this evaluation, we considered the hardware complexity
of the proposed SHER-3DR router. For this evaluation, we
use the NANGATE 45nm technology library [27]. Area cost
and power consumption analyses are performed with the
Synopsys c© Design Compiler. The power consumption in-
formation is analyzed based on the switching activity of the
A Low-overhead Soft-Hard Fault-Tolerant Architecture and Management Scheme for Reliable... 9
Algorithm 2: Fault Detection, Diagnosis and Recov-
ery.
// Automatic Retransmission Request
Input: transmitting f lit
// Transmitted Buffer Position
Input: bu f f er position
// Control signal to all Fault-Tolerance modules
Output: RAB control, BLoD control, LAFT control
// Transmit the flit, get the ECC’s feedback
1 Transmit(transmitting f lit);
2 ECC result = ECC-Decoder(transmitting f lit);
// DETECTION PHASE:
3 if ECC result == ARQ then
// Automatic Retransmission Request
4 increase(ARQ counter);
5 ARQ(transmitting f lit);
6 else
// The transmitted flit is non faulty
7 Finish;
8 end
// Check the number of consecutive ARQs
9 if (ARQ counter == 2) then
// There is a permanent fault
// Jump to DIAGNOSIS-RECOVERY PHASE
10 end
// DIAGNOSIS-RECOVERY PHASE:
// Start with Input Buffer Checking
11 Bu f f er Failure← Bu f f er Checking(bu f f er position);
12 if (Bu f f er Failure == Yes) then
// Random Access Buffer is received the
position to handle.
13 RAB Control = bu f f er position;
14 Finish;
15 else
// The buffer slot is non faulty.
// Move to Crossbar Checking: using a
Bypass-Link.
16 BLoD control = enable;
// Get the ECC’s feedback and detect with
ARQ counter.
17 if (ARQ counter == 2) then
// BLoD cannot fix the fault, the link is
failed.
18 BLoD control = release;
// The LAFT routing algorithm handles the
faulty link.
19 LAFT control = faulty;
20 Finish;
21 else
// BLoD already fixed the failure, the
recovery step is finished.
22 Finish;
23 end
24 end
router under the uniform benchmark. We start first by ob-
serving the additional hardware added to the baseline system
when we employ the hard fault tolerance model (3D-FTO
router). Then, we evaluate the impact when we consider
the soft error tolerant model (Soft Error Tolerant router).
Finally, we evaluate the completed SHER-3DR system in-
cluding both soft and hard fault tolerant mechanisms. The
configurations of the network are shown in Table 2 and the
layout of a single SHER-3DR router is depicted in Fig. 8.
Table 3 illustrates the hardware complexity results of
SHER-3DR router in terms of area, power (static, dynamic,
and total), and speed. In the hard fault tolerance router (3D-
FTO), the area and power consumption overheads have in-
creased by 1.43% and 25.65%, respectively. The maximum
speed has also slightly decreased. On the other hand, our
soft error handling mechanism adds seven ARQ buffers and
some combinational logic which increase the area and power
consumption more significantly. However, SHER-3DR in-
troduces 7.50% and 3.74% extra area and power consump-
tion, respectively, when compared to the soft error tolerant
model. In comparison to the baseline model, SHER-3DR
increases the area and power consumption by 56.39% and
112.10%, respectively, while the maximum speed decreases
by 33.70%.
The area cost and power consumption of the proposed
router is given by Equation 1 where pii represents the area
cost or power consumption of module i. The SHER-3DR
router consists of four main modules: input-ports, switch-
allocator, crossbar, and fault manager.
pirouter = piinput−ports +piswitch−allocator +picrossbar +pi f ault−manager
(1)
The details of an input port, a switch-allocator and a crossbar
are given in Equation 2.
piinput−ports = pioriginal−input−ports+piRAB−controller+piPCR−controller+piECC
piswitch−allocator = pioriginal−switch−allocator + piPCR−monitor
picrossbar = pioriginal−crossbar + pibypass−links + piARQ−bu f f ers
(2)
We can observe the overheads in power consumption and
area cost that are caused by the fault-tolerance mechanisms
(RAB-controller, PCR-controller, ECC, BLoD, ARQ buffers).
Figure 7 provides the evaluation results of power consump-
tion and area cost of SHER-3DR. In terms of area cost, the
input ports occupy the majority with over 67% which is fol-
lowed by the crossbar (20%) and the switch allocator (9%).
The fault manager, which supports DDRM, uses only about
4% of the overall area cost. In terms of power consumption,
the input ports consume over 80% of the total value. The
fault manager module also causes an insignificant increase
in power consumption (3%).
When compared to the baseline OASIS router, the pro-
posed SHER-3DR consumes more power consumption and
costs more area. As shown in Fig. 7, SHER-3DER increases
the area and power of all three main modules (crossbar, in-
put ports, and switch-allocator). The overhead can be ana-
lyzed by Equation 2 where additional modules are attached
to support the fault-tolerance mechanisms.
10 Khanh N. Dang et al.
Table 2 Simulation configurations.
Parameter/System Value
Network Size (x × y × z)
Matrix 6 × 6 × 3
Transpose 4 × 4 × 4
Uniform 4 × 4 × 4
Hotspot 10% 4 × 4 × 4
H.264 3 × 3 × 3
VOPD 3 × 2 × 2
MWD 2 × 2 × 3
PIP 2 × 2 × 2
Total Injected Packets
Matrix 1,080
Transpose 640
Uniform 8,192
Hotspot 10% 8,192
H.264 8,400
VOPD 3,494
MWD 1,120
PIP 512
Packet’s Size Hotspot 10% 10 flits + 10% for hotspot nodesOthers 10 flits
Flits Size 44 bits
Header Size 14 bits
Payload Bit Baseline, 3D-FTO 30 bitsSoft Error Tolerance, 3D-FETO 18 bits
Parity Bit Baseline, 3D-FTO 0 bitsSoft Error Tolerance, 3D-FETO 12 bits (2× SECDED(22,16))
Buffer Depth 4
Switching Wormhole-like
Flow-control Stop-Go
Routing LAFT
Table 3 Hardware complexity evaluation and comparison results.
Area Power Speed
Model (µm2) (mW) (Mhz)
Static Dynamic Total
Baseline LAFT router 18,873 5.1229 0.9429 6.0658 925.28
3D-FTO router 19,143 6.4280 1.1939 7.6219 909.09
Soft Error Tolerance router 27,457 9.7314 2.6710 12.4024 625.00
SHER-3DR 29,516 10.0819 2.7839 12.8658 613.50
 0
 0.1
 0.2
 0.3
 0.4
 0.5
 0.6
 0.7
 0.8
 0.9
 1
3D-FETO Baseline
No
rm
aliz
ed
 Ar
ea
 Co
st
(a) Area Cost
Crossbar Input-Port Switch Allocation Fault-manager
 0
 0.1
 0.2
 0.3
 0.4
 0.5
 0.6
 0.7
 0.8
 0.9
 1
3D-FETO Baseline
No
rm
aliz
ed
 Po
we
r C
on
su
mp
tio
n
(a) Power Consumption
Fig. 7 Area cost and power consumption analysis.
Although our proposed models are penalized in terms
of area, power consumption, and maximum frequency due
to additional logic and registers that are necessary for fault
TSV
area
450  μm
4
5
0
  μ
m
Fig. 8 Layout of a single SHER-3DR router for the 3D-FETO sys-
tem. The SHER-3DR router was designed in Verilog-HDL and syn-
thesized using 45nm technology library [27]. For the Through Silicon
Via (TSV) integration, we used FreePDK3D45 kit compiler [28]. The
SHER-3DR router is designed on a 450µm×450µm and the TSV array
is 208 TSVs.
handling mechanisms, they provide an improved resiliency
against a significant amount of soft and hard faults.
A Low-overhead Soft-Hard Fault-Tolerant Architecture and Management Scheme for Reliable... 11
5.3 Latency Evaluation
In the second experiment, we evaluate the performance of
the proposed architecture in terms of latency over various
benchmark programs and error injection rates for three sys-
tem configurations: (1) Hard-fault tolerant system (3D-FTO),
(2) Soft-error tolerant OASIS system, and (3) Hard-fault and
Soft-error tolerant system (3D-FETO). The simulation re-
sults are shown in Figs. 9 and 10. From these graphs, we
notice that with 0% hard faults (in input buffer and cross-
bar only), 3D-FTO has similar performance to the baseline
system (LAFT-OASIS). In addition, we found that even at a
33% fault-rate, 3D-FTO increases the latency by only 1.71%,
11.38%, 8.79% and 13.73% for Transpose, Uniform, 6 × 6
Matrix, and Hotspot-10%, respectively. With realistic bench-
marks, the performance of 3D-FTO slightly degrades at low
error-rates, but it suffers more of an impact at high error-
rates (20% and 33%) since the flit encounter bottlenecks
due to errors inside the input buffers. However, the proposed
3D-FETO model still works even at high fault-rates while
the baseline model collapses at a 5% error-rate. We used the
same benchmark programs to evaluate the soft error tolerant
model. Since both the proposed Pipeline Computation Re-
dundancy mechanism and ECC require additional clock cy-
cles, we can observe a significant effect on average packet
latency. For the 0%, 10%, 20% and 33% fault-rates, the
Soft Error Tolerant model increases the average delay in
the Transpose benchmark by 18.57%, 28.74%, 34.54% and
49.62%, respectively. Finally, we evaluate the proposed 3D-
FETO system with both soft error and hard fault handling
schemes. As shown in Figs. 9 and 10, 3D-FETO has demon-
strated a significant impact on the average latency, which
has mostly doubled for both realistic and synthetic bench-
marks. At a 33% fault-rate using Matrix, Uniform, Trans-
pose benchmarks, 3D-FETO’s average latency increases by
78.44%, 50.73% and 67.18% in terms of average packet la-
tency. The degradation is caused by both soft errors and hard
fault tolerance mechanisms: (1 the) ECC+ARQ and PCR
both require additional re-transmission clock cycles; (2) the
RAB and LAFT routing algorithm may disable a part of the
network which causes congestion. However, it still main-
tains the ability to work under an extremely high fault-rate
(33% for hard faults and 33% for soft errors).
5.4 Throughput Evaluation
Figure 11 depicts the throughput evaluation with the adopted
synthetic benchmarks. At a 0% error rate, 3D-FTO (hard-
fault tolerance) presents the best throughput which matches
the capacity of the baseline LAFT-OASIS. The Soft Error
Tolerant OASIS and the proposed 3D-FETO have less through-
put due to their soft error tolerance mechanisms. When the
errors are injected into the system, we can observe a degra-
dation in throughput. Thanks to the efficient hard fault tol-
erance scheme and the fault-tolerant routing algorithm, 3D-
FTO at a 33% error-rate provides a slightly decreased through-
put: 40.18%, 43.96%, 43.55% and 32.59% for Transpose,
Matrix, Uniform, and Hotspot 10%, respectively. For the
Soft Error Tolerant OASIS, the system requires re-transmission
via the ARQ mechanism and the re-execution for the soft
error mechanism. Therefore, the throughput is degraded due
to extra clock cycles. The proposed 3D-FETO, which is a
fusion of both hard fault tolerance and soft error tolerant
mechanisms, inherits both degradations; however, these sys-
tems provide the ability to handle up to a 33% error rate (the
limitation of the soft error mechanism).
Table 4 Successful arrival-rate comparison results for a 5×5×4 system
configuration under Uniform traffic.
Algorithm / Fault-rate 1% 5% 10% 15% 20%
XYZ 91% 62% 41% 28% 23%
Hybrid-XYZ 99% 83% 62% 44% 36%
8-RW 100% 95% 85% 69% 59%
Odd-Even 96% 85% 67% 53% 43%
Hybrid-Odd-Even 100% 94% 83% 70% 61%
4N-FIRST 98% 89% 72% 68% 46%
4NP-FIRST 97% 98% 95% 83% 76%
LAFT-OASIS 100% 100% 99% 98% 95%
3D-FETO 100% 100% 99% 99% 97%
5.5 Reliability Evaluation
5.5.1 Arrival Rate
This subsection presents the reliability evaluation of the pro-
posed 3D-FETO system over several hard fault and soft er-
rors injection rates. For comparison, seven systems adopting
different routing algorithms are selected [30]: XYZ, Hybrid-
XYZ, 8-Random-Walk (8-RW), Odd-Even, Hybrid-Odd-Even,
4N-First, and 4NPFirst. Among these algorithms, we can
find deterministic 3D routing algorithms, fault-tolerant 2D
algorithms that were extended to the third dimension, and
also turn-model based schemes that were proposed for fault-
tolerant 3D-NoC systems. We adopted the same simulation
environment and assumptions made in [30] from where the
arrival-rate results were also obtained. For fair comparison,
we assume that the faults can occur at any link with LAFT;
thus, we eliminate the two assumptions that are necessary
for the algorithm to efficiently work: (1) the links connect-
ing the PE to the local input and output ports are always non-
faulty. (2) There exists at least one non-faulty path between
a (source, destination) pair. Moreover, we also evaluate the
arrival-rate of our final system with the enhancements by
12 Khanh N. Dang et al.
 0
 10
 20
 30
 40
 50
 60
0% 10% 20% 33%
Av
era
ge
 La
ten
cy 
(cy
cle
s/p
ac
ke
t)
Fault Rate (%)
(a) Transpose
Baseline LAFT-OASIS
Hard Fault Tolerant OASIS
Soft Error Tolerant OASIS
3D-FETO
 0
 10
 20
 30
 40
 50
0% 10% 20% 33%
Av
era
ge
 La
ten
cy 
(cy
cle
s/p
ac
ke
t)
Fault Rate (%)
(b) Uniform
 0
 5
 10
 15
 20
 25
0% 10% 20% 33%
Av
era
ge
 La
ten
cy 
(cy
cle
s/p
ac
ke
t)
Fault Rate (%)
(c) Matrix
 0
 5
 10
 15
 20
 25
 30
 35
 40
0% 10% 20% 33%
Av
era
ge
 La
ten
cy 
(cy
cle
s/p
ac
ke
t)
Fault Rate (%)
(d) Hotspot
Fig. 9 Average packet latency evaluation of the synthetic benchmarks.
Random-Access-Buffer and Bypass-Link-on-Demand. Instead
of only distributing faults on the inter-router channel, they
are randomly assigned to input buffers, crossbar, or the inter-
router channel.
Table 4 and Table 5 depict the arrival-rate results for a 5×
5×4 system (100 nodes) under Uniform and Transpose traf-
fic patterns, respectively. Due to its lack of support for fault-
tolerance, XYZ routing demonstrates the worst Arrival-rate
for both applications. Its variant Hybrid-XYZ shows slightly
better results, but it is still considered unacceptable. Despite
the fact that 8-RW is fault-tolerant, its Arrival-rate consid-
erably degrades as we increase the fault-rate. This can be
explained by the frequent deadlock-occurrence with this al-
gorithm that is considered to be one of its main drawbacks.
4NP-FIRST is a fault-tolerant routing algorithm targeted for
3D-NoCs. However, it does not scale very well when we in-
crease the fault-rate. In fact, one third of the injected packets
fail to reach their destinations at a 20% fault-rate, which can
be seen in Table 5.
Among the considered algorithms, 3D-FETO appears to
be the most reliable solution, providing a scalable arrival-
rate that does not go under 97% in both applications, even
at a 20% fault-rate. When observing the results with the two
applications, 3D-FETO with the LAFT algorithm is consid-
ered to be the only scheme that takes advantage of long dis-
tance communications in Transpose traffic. This is in con-
trast with the remaining algorithms where their reliability
degrades considerably with this application. In fact, the com-
bination of look-ahead routing and the path prioritization
using the diversity value in LAFT significantly increases the
probability for packets to find non-faulty paths to reach their
destinations.
The arrival rates of the proposed 3D-FETO reach over
97% in the worst case (20% fault-rate) while LAFT-OASIS’s
arrival rates are 95% and 96%. With other rates, 3D-FETO
presents its capacity for high reliability with an arrival-rate
of over 98%. When we analyzed the possible causes for the
failing 5%, we observed the occurrence of cases where all
the connecting links of a given router are faulty: for exam-
ple, the East, North, and UP links of the bottom-left router
of the network are broken. Thus, the router cannot receive or
inject any flit from/to the network. Another failure case man-
ifests when the link connecting the router to the attached PE
is faulty. As expected, these two cases justify the two as-
sumptions that we previously made to ensure the efficiency
of LAFT’s fault-tolerance capabilities.
A Low-overhead Soft-Hard Fault-Tolerant Architecture and Management Scheme for Reliable... 13
 0
 50
 100
 150
 200
0% 10% 20% 33%
Av
era
ge
 La
ten
cy 
(cy
cle
s/p
ac
ke
t)
Fault Rate (%)
(a) H.264
Baseline LAFT-OASIS
Hard Fault Tolerant OASIS
Soft Error Tolerant OASIS
3D-FETO
 0
 5
 10
 15
 20
 25
0% 10% 20% 33%
Av
era
ge
 La
ten
cy 
(cy
cle
s/p
ac
ke
t)
Fault Rate (%)
(b) PIP
 0
 5
 10
 15
 20
0% 10% 20% 33%
Av
era
ge
 La
ten
cy 
(cy
cle
s/p
ac
ke
t)
Fault Rate (%)
(c) MWD
 0
 5
 10
0% 10% 20% 33%
Av
era
ge
 La
ten
cy 
(cy
cle
s/p
ac
ke
t)
Fault Rate (%)
(d) VOPD
Fig. 10 Average packet latency evaluation of the realistic benchmarks.
Table 5 Successful arrival-rate comparison results for a 5×5×4 system
configuration under Transpose traffic.
Algorithm / Fault-rate 1% 5% 10% 15% 20%
XYZ 85% 46% 31% 14% 11%
Hybrid-XYZ 99% 68% 42% 25% 20%
8-RW 93% 82% 62% 44% 36%
Odd-Even 97% 84% 53% 42% 32%
Hybrid-Odd-Even 99% 92% 77% 62% 53%
4N-FIRST 96% 86% 68% 50% 37%
4NP-FIRST 100% 97% 89% 75% 63%
LAFT-OASIS 100% 100% 100% 99% 96%
3D-FETO 100% 100% 100% 99% 98%
5.5.2 Mean Time To Failure Improvement
Besides the arrival rate evaluation, we assessed our fault-
tolerant system in terms of Mean Time To Failure (MTTF)
improvement. We define a system at healthy if it operates
correctly (100% arrival rate, accurate fault detection and re-
covery function). Otherwise, the system is marked as failed.
To obtain more precise results, we use the net-list (gate-
level) models from the complexity evaluation. Moreover,
faults are not only injected to the fault-tolerance modules
but they are also injected to other modules (controller, man-
agement module). Before the MTTF assessment, we first as-
sume the original system has a natural fault rate: λraw. The
MTTF value can be given as the following.
MTT Fraw =
1
λraw
(3)
To measure the MTTF value of the fault-tolerant system, we
use a Monte-Carlo based simulation as shown in Figure 12.
At the beginning of the simulation, we define the number of
experiments (N) and the fault models and distribution mech-
anisms. Faults will be generated in two types: soft errors
(randomly occur within a clock period) and hard faults (oc-
cur from the beginning to the end of experiment). There are
also two fault models: stuck-at “0” and stuck-at “1”. Faults
are injected to the dedicated gates which selected by a ran-
dom generator. We use two distributions: (1) flat: randomly
inject to any gate inside a router; (2) weight: more than 80%
of faults are injected to the fault-tolerant modules (buffer,
crossbar, next-port-computing, switch-allocator). For each
experiment i, we inject faults and examine the correctness of
the system (data’s accuracy, fault-tolerance configurations).
Faults will be injected until the system is determined as fail-
14 Khanh N. Dang et al.
 0
 0.1
 0.2
 0.3
 0.4
 0.5
 0.6
0% 10% 20% 33%
Th
rou
gh
pu
t (f
lits
/no
de
/cy
cle
)
Fault Rate (%)
(a) Transpose
Baseline LAFT-OASIS
Hard Fault Tolerant OASIS
Soft-Error Tolerant OASIS
3D-FETO
 0
 0.1
 0.2
 0.3
 0.4
 0.5
0% 10% 20% 33%
Th
rou
gh
pu
t (f
lits
/no
de
/cy
cle
)
Fault Rate (%)
(b) Uniform
 0
 0.1
 0.2
 0.3
 0.4
 0.5
 0.6
 0.7
0% 10% 20% 33%
Th
rou
gh
pu
t (f
lits
/no
de
/cy
cle
)
Fault Rate (%)
(c) Matrix
 0
 0.1
 0.2
 0.3
 0.4
 0.5
0% 10% 20% 33%
Th
rou
gh
pu
t (f
lits
/no
de
/cy
cle
)
Fault Rate (%)
(d) Hotspot
Fig. 11 Throughput evaluation of the synthetic benchmarks.
ure. At the end of an experiment, the number of faults is
recorded for the final process. To calculate the MTTF value
of a system, the average number of faults is used in the fol-
lowing equation.
MTT Fsystem =
∑
fi × MTT Fraw
N
(4)
In order to understand the efficiency of the fault-tolerance,
the ratio of two MTTF values is used as in Equation 5.
ImprovementMTT F =
MTT F f ault−tolerant
MTT Foriginal
(5)
Because the raw fault rate depends on the technology
parameters and the operating conditions, they will require a
highly complex evaluation. To alleviate the complexity, we
assume the fault-tolerant and original system have a simi-
lar raw fault rate. Therefore, the MTTF improvement can
be obtained by Equation 6 where, AFTF is average fault to
failure.
ImprovementMTT F =
AFT F f ault−tolerant
AFT Foriginal
(6)
Table 6 shows the average number of faults to failure
after 1000 simulations. The test scheme is built to function-
ally verify the data communication and the fault-tolerance
mechanisms. In the flat distribution, the proposed SHER-
3DR enhances the MTTF of hard faults and soft errors by
1.93 and 1.49 times, respectively. With the weight distribu-
tion, the proposal shows more improvement since the faults
focus on the fault-tolerant modules. SHER-3DR’s hard fault
tolerance is 2.96 times better the baseline OASIS router. In
terms of soft error MTTF, SHER-3DR is 5.32 times better
than the original router. In conclusion, we observe a signif-
icant improvement in terms of MTTF from our proposed
mechanism. Along with the high arrival rates, we demon-
strated the reliability enhancement of our system.
6 Conclusion and Future Work
In this paper, we proposed a comprehensive fault tolerant
3D-Network-on-Chip (3D-NoC) system architecture for highly-
reliable many-core Systems-on-Chips (SoCs), named 3D-
FETO. The proposed system is based on two approaches.
A Low-overhead Soft-Hard Fault-Tolerant Architecture and Management Scheme for Reliable... 15
Define the total 
number of 
experiments (N)
Identify random 
parameters of the 
system
Assume 
appropriate 
distributions for the 
parameters
Initialize counter  𝑖 = 1
Generate a 
uniformly 
distributed 
number for 
each 
“experiment” i
Generate the 
random 
variable 
numbers to the 
system’s 
distribution
Evaluate by 
using the set of 
random number
Determine the 
system is a 
success or a 
failure
Is I = 
N?
𝑖 = 𝑖 + 1
Calculate the system MTTF:
𝑀𝑇𝑇𝐹 =
∑𝑓𝑖 ×𝑀𝑇𝑇𝐹𝑟𝑎𝑤
N
No
Yes
failure
success
𝑓𝑖 = 0
𝑓𝑖 = 𝑓𝑖 + 1
𝑓1 = 0
Fig. 12 MTTF simulation methodology.
Table 6 Average number of faults to failure.
Fault-Type Distribution Baseline router SHER-3DR router MTTF Improvement
Hard Fault Flat 2.37 4.58 1.93Weighted 2.055 6.085 2.96
Soft Error Flat 17.928 26.770 1.49Weight 4.037 21.492 5.32
First, a comprehensive mechanism to handle both soft error
and hard faults in a 3D-NoC router is proposed. The hard
fault support is achieved by leveraging reconfigurable com-
ponents to handle permanent faults in links, input buffers,
and crossbars, while soft error tolerance is obtained via ef-
ficient and light-weight software redundancy that enables
fault recovery in the router pipeline stages. In the second
approach, the system can support a detection, diagnosis and
recovery technique which makes it independent of any com-
plex and costly testing mechanisms commonly found in con-
ventional systems.
Through extensive evaluation, we showed that the pro-
posed 3D-FETO was able to recover efficiently from a sig-
nificant number of soft and hard errors at different fault-
rates, reaching up to 33%. This means that 3D-FETO can
provide up to a 98% packet arrival rate even when almost
one-third of its components have failed. Despite the per-
formance degradation and hardware complexity penalty, we
still consider that this overhead is acceptable. This is be-
cause we made sure that the system is still functional at high
fault rates where previously proposed systems fail to deliver
packets. As reliability constitutes one of the main challenges
in future SoC design, we demonstrated that the proposed
3D-FETO can be used as a reliable and independent system
capable of ensuring fault resiliency in worst case scenarios
and that it can be adopted for mission critical applications
where correct data delivery is primordial.
As a future work, we are planning to investigate the faults
within Through-Silicon-Vias of 3D-ICs/3D-NoCs to provide
a sufficient fault-tolerance method for 3D-NoC systems. More-
over, the degradation factors of the reliability, such as ther-
16 Khanh N. Dang et al.
mal stress, operating voltages, design characteristics should
be also studied.
Acknowledgements This work is partially supported by Competitive
Research Funding (CRF), The University of Aizu, Reference P-11 (2016),
and JSPS KAKENHI Grant Number JP30453020. This work is also
supported by VLSI Design and Education Center (VDEC), the Univer-
sity of Tokyo, Japan, in Collaboration with Synopsys, Inc. and Cadence
Design Systems, Inc. The first and the last authors in the author-list are
the main contributors of this work.
References
1. Ahmed, A.B., Abdallah, A.B.: Adaptive fault-tolerant architec-
ture and routing algorithm for reliable many-core 3D-NoC sys-
tems. Journal of Parallel and Distributed Computing 93-94, 30–43
(2016)
2. Ben Abdallah, A.: Multicore Systems-on-Chip: Practical Hard-
ware/Software Design, 2nd Edition. Atlantis (2013)
3. Ben Abdallah, A., Masahiro, S.: Basic Network-on-Chip Intercon-
nection for Future Gigascale MCSoCs Applications: Communica-
tion and Computation Orthogonalization. In: Proc. of the Sym-
posium on Science, Society, and Technology (JASSST2006), pp.
1–7 (2006)
4. Ben Ahmed, A., Ben Abdallah, A.: LA-XYZ: low latency, high
throughput look-ahead routing algorithm for 3D network-on-chip
(3D-NoC) architecture. In: IEEE 6th International Symposium on
Embedded Multicore Socs (MCSoC), pp. 167–174. IEEE (2012)
5. Ben Ahmed, A., Ben Abdallah, A.: Low-overhead Routing Algo-
rithm for 3D Network-on-Chip. In: Third International Confer-
ence on Networking and Computing (ICNC), pp. 23–32 (2012)
6. Ben Ahmed, A., Ben Abdallah, A.: Architecture and design of
high-throughput, low-latency, and fault-tolerant routing algorithm
for 3D-network-on-chip (3D-NoC). The Journal of Supercomput-
ing 66(3), 1507–1532 (2013)
7. Ben Ahmed, A., Ben Abdallah, A.: Graceful deadlock-free fault-
tolerant routing algorithm for 3D Network-on-Chip architectures.
Journal of Parallel and Distributed Computing 74(4), 2229–2240
(2014)
8. Bertozzi, D., Benini, L., De Micheli, G.: Error control schemes
for on-chip communication links: the energy-reliability tradeoff.
IEEE Transactions on Computer-Aided Design of Integrated Cir-
cuits and Systems 24(6), 818–831 (2005)
9. Bertozzi, D., Jalabert, A., Murali, S., Tamhankar, R., Stergiou, S.,
Benini, L., De Micheli, G.: NoC synthesis flow for customized do-
main specific multiprocessor systems-on-chip. IEEE Transactions
on Parallel and Distributed Systems 16(2), 113–129 (2005)
10. Chen, P., Dai, K., Wu, D., Rao, J., Zou, X.: The parallel algorithm
implementation of matrix multiplication based on ESCA. In: IEEE
Asia Pacific Conference on Circuits and Systems (APCCAS), pp.
1091–1094. IEEE (2010)
11. Chien, A.A., Kim, J.H.: Planar-adaptive routing: low-cost adaptive
networks for multiprocessors. Journal of the ACM (JACM) 42(1),
91–123 (1995)
12. Constantinides, K., Plaza, S., Blome, J., Zhang, B., Bertacco, V.,
Mahlke, S., Austin, T., Orshansky, M.: Bulletproof: A defect-
tolerant CMP switch architecture. In: The Twelfth International
Symposium on High-Performance Computer Architecture, pp. 5–
16. IEEE (2006)
13. Dally, W.J., Towles, B.P.: Principles and practices of interconnec-
tion networks. Elsevier (2004)
14. Dang, K.N., Meyer, M., Okuyama, Y., Tran, X.T., Ben Abdallah,
A.: A soft-error resilient 3d network-on-chip router. In: IEEE 7th
International Conference on Awareness Science and Technology
(iCAST), pp. 84–90 (2015)
15. DeOrio, A., Fick, D., Bertacco, V., Sylvester, D., Blaauw, D., Hu,
J., Chen, G.: A reliable routing architecture and algorithm for
NoCs. IEEE Transactions on Computer-Aided Design of Inte-
grated Circuits and Systems 31(5), 726–739 (2012)
16. Dixit, A., Wood, A.: The impact of new technology on soft error
rates. In: 2011 International Reliability Physics Symposium, pp.
5B.4.1–5B.4.7 (2011)
17. Eghbal, A., Yaghini, P.M., Bagherzadeh, N., Khayambashi, M.:
Analytical Fault Tolerance Assessment and Metrics for TSV-based
3D Network-on-Chip. IEEE Transactions on Computers 64(12),
3591–3604 (2015)
18. Ernst, D., Kim, N.S., Das, S., Pant, S., Rao, R., Pham, T., Ziesler,
C., Blaauw, D., Austin, T., Flautner, K., et al.: Razor: A low-power
pipeline based on circuit-level timing speculation. In: Proceedings
36th Annual IEEE/ACM International Symposium on Microarchi-
tecture (MICRO-36), pp. 7–18. IEEE (2003)
19. Fick, D., DeOrio, A., Chen, G., Bertacco, V., Sylvester, D.,
Blaauw, D.: A highly resilient routing algorithm for fault-tolerant
NoCs. In: 2009 Design, Automation Test in Europe Conference
Exhibition, pp. 21–26 (2009)
20. Herna´ndez, C., Silla, F., Santonja, V., Duato, J.: Dealing with vari-
ability in NoC links. In: 2nd Workshop on Diagnostic Services in
Network-on-Chips, pp. 4–10 (2008)
21. Hsiao, M.Y.: A class of optimal minimum odd-weight-column
sec-ded codes. IBM Journal of Research and Development 14(4),
395–401 (1970)
22. ITRS: 2012 Edition Update Process Integration, Devices, and
Structures. Tech. rep., The International Technology Roadmap for
Semiconductor (2012). http://www.itrs2.net/2012-itrs.
html(accessed 16.06.16)
23. Karl, E., Blaauw, D., Sylvester, D., Mudge, T.: Reliability Model-
ing and Management in Dynamic Microprocessor-based Systems.
In: Proceedings of the 43rd Annual Design Automation Confer-
ence, DAC ’06, pp. 1057–1060. ACM, New York (2006)
24. Lehtonen, T., Liljeberg, P., Plosila, J.: Online reconfigurable self-
timed links for fault tolerant NoC. VLSI design 2007, 1–13 (2007)
25. Lehtonen, T., Wolpert, D., Liljeberg, P., Plosila, J., Ampadu, P.:
Self-adaptive system for addressing permanent errors in on-chip
interconnects. IEEE Transactions on Very Large Scale Integration
(VLSI) Systems 18(4), 527–540 (2010)
26. Lin, S., Costello, D., Miller, M.: Automatic-repeat-request error-
control schemes. IEEE Communications Magazine 22(12), 5–17
(1984)
27. NanGate Inc.: Nangate Open Cell Library 45 nm. http://www.
nangate.com/. (accessed 16.06.16)
28. NCSU Electronic Design Automation: FreePDK3D45 3D-
IC process design kit. http://www.eda.ncsu.edu/wiki/
FreePDK3D45:Contents. (accessed 16.06.16)
29. Parikh, R., Bertacco, V.: Formally Enhanced Runtime Verification
to Ensure NoC Functional Correctness. In: Proceedings of the
44th Annual IEEE/ACM International Symposium on Microarchi-
tecture, MICRO-44, pp. 410–419. ACM, New York (2011)
30. Pasricha, S., Zou, Y.: A low overhead fault tolerant routing scheme
for 3D Networks-on-Chip. In: 12th International Symposium on
Quality Electronic Design (ISQED), pp. 1–8. IEEE (2011)
31. Prodromou, A., Panteli, A., Nicopoulos, C., Sazeides, Y.: No-
CAlert: An On-Line and Real-Time Fault Detection Mechanism
for Network-on-Chip Architectures. In: Proceedings of the 2012
45th Annual IEEE/ACM International Symposium on Microarchi-
tecture (MICRO), pp. 60–71 (2012)
32. Radetzki, M., Feng, C., Zhao, X., Jantsch, A.: Methods for fault
tolerance in networks-on-chip. ACM Computing Surveys (CSUR)
46(1), 8 (2013)
33. Rahmani, A.M., Vaddina, K.R., Latif, K., Liljeberg, P., Plosila,
J., Tenhunen, H.: High-performance and fault-tolerant 3D noc-bus
hybrid architecture using arb-net-based adaptive monitoring plat-
form. IEEE Transactions on Computers 63(3), 734–747 (2014)
A Low-overhead Soft-Hard Fault-Tolerant Architecture and Management Scheme for Reliable... 17
34. Ravindan, D.K.: Structural fault-tolerance on the noc circuit level.
Tech. rep., Institut fur Technische Informatik, Universitat Stuttgart
(2009)
35. Shamshiri, S., Cheng, K.T.: Yield and cost analysis of a reliable
noc. In: 27th IEEE VLSI Test Symposium, pp. 173–178. IEEE
(2009)
36. Shamshiri, S., Ghofrani, A.A., Cheng, K.T.: End-to-end error cor-
rection and online diagnosis for on-chip networks. In: IEEE Inter-
national Test Conference (ITC), pp. 1–10. IEEE (2011)
37. Sivaram, R.: Queuing delays for uniform and nonuniform traffic
patterns in a MIN. ACM SIGSIM Simulation Digest 22(1), 17–27
(1992)
38. Yu, Q., Ampadu, P.: Transient and permanent error co-
management method for reliable networks-on-chip. In: Fourth
ACM/IEEE International Symposium on Networks-on-Chip
(NOCS), pp. 145–154. IEEE (2010)
39. Yu, Q., Zhang, M., Ampadu, P.: Addressing network-on-chip
router transient errors with inherent information redundancy.
ACM Transactions on Embedded Computing Systems (TECS)
12(4), 105:1–105:21 (2013)
40. Zekri, A.S., Sedukhin, S.G.: The general matrix multiply-add op-
eration on 2D torus. In: 20th International Parallel and Distributed
Processing Symposium (IPDPS), pp. 8–16. IEEE (2006)
