Soft-Error and Hard-fault Tolerant Architecture and Routing Algorithm
  for Reliable 3D-NoC Systems by Dang, Khanh N. et al.
Soft-Error and Hard-fault Tolerant Architecture and
Routing Algorithm for Reliable 3D-NoC Systems
Khanh N. Dang, Yuichi Okuyama, and Abderazek Ben Abdallah
The University of Aizu
Graduate School of Computer Science and Engineering
Aizu-Wakamatsu 965-8580, Japan
Email: {d8162103, okuyama, benab}@u-aizu.ac.jp
Abstract—Network-on-Chip (NoC) paradigm has been pro-
posed as an auspicious solution to handle the strict communi-
cation requirements between the increasingly large number of
cores on a single multi and many-core chips. However, NoC
systems are exposed to a variety of manufacturing, design and
energetic particles factors making them vulnerable to permanent
(hard) faults and transient (soft) errors. In this paper, we present
a comprehensive soft error and hard fault tolerant 3D-NoC
architecture, named 3D-Hard-Fault-Soft-Error-Tolerant-OASIS-
NoC (3D-FETO). With the aid of adaptive algorithms, 3D-FETO
is capable of detecting and recovering from soft errors occurring
in the routing pipeline stages and is leveraging on reconfigurable
components to handle permanent faults occurrence in links, input
buffers, and crossbar. In-depth evaluation results show that the
3D-FETO system is able to work around different kinds of
hard faults and soft errors while ensuring graceful performance
degradation, minimizing the additional hardware complexity and
remaining power-efficient.
Index Terms—Fault Tolerance, Routing Algorithm, 3D
Network-on-Chip, Architecture
I. Introduction
In the past few years, the benefits of 3D Integrated Circuits
(3D-ICs) and the regularity of mesh-based Network-on-Chips
(NoCs) have been fused into a promising architecture, called
3D-Network-on-Chip (3D-NoC) [1], opening a new horizon
for IC design. In fact, the parallelism of Network-on-Chip can
be enhanced in the third dimension thanks to the short wire
length and low power interconnects of 3D-ICs. As a result, the
3D-NoC paradigm is considered as one of the most advanced
and auspicious architectures for the future of IC design, as
it is capable of providing extremely high bandwidth and low
power interconnects.
While the NoC paradigm has been increasing in popularity
with several commercial chips, it is threatened by the decreas-
ing reliability of aggressively scaled transistors. Transistors
are approaching the fundamental limits of scaling, with gate
widths nearing the molecular scale, resulting in break down
and wear out in end products. Therefore, future complex
3D-NoCs systems will require significant tolerance to many
simultaneous soft errors and hard faults.
Hard faults, including both permanent faults and intermittent
faults, can occur during the manufacturing stage or under
specific operation circumstances. For both permanent and
intermittent faults, the most natural solution is using redundant
components.
Soft errors arise from energetic particles such as alpha
particles and neutrons from cosmic rays generating electron-
hole pairs as they pass through a device. Soft errors do not
permanently defect the gate and only occur in a short period of
time. Because of their special characteristics, they are unpre-
dictable and unavoidable. Unlike permanent and intermittent
faults, transient faults cannot be fixed by just replacing the
affected component. Instead, they can be recovered from by
repeating the erroneous operation or information redundancy
(e.g., Error Correction Code (ECC)). Therefore, without effi-
cient protection mechanism, these errors can compromise the
system’s functionality and reliability.
Most of the conducted works handle the hard faults and soft
errors separately. Hard faults handling schemes are mainly
based on two main approaches: (a) fault-tolerant routing
algorithms which enable packets to avoid faulty nodes in the
network [1], [2] and (b) architecture-based methods which use
hardware (components) redundancy or/and reconfiguration to
recover from faults [2], [3], [4]. Soft errors recovery is also
solved by two main schemes: (a) Error Correction Code (ECC)
based methods for data corruption [5], [6], [7] and (b) Control
logic (temporal redundancy) based methods [8], [9], [10].
Although theses works provide solutions to separately handle
hard faults and soft errors in the router, no comprehensive
solutions were proposed to simultaneously handle both hard
faults and soft errors in a 3D-NoC system. In addition, the
error detection and diagnosis in a NoC architectures have been
studied thoroughly in the scope of offline testing. However,
with soft errors and intermittent faults becoming a dominant
failure mode in modern NoC and general VLSI systems, a
widespread deployment of online test approaches has become
crucial.
In this paper, we present a comprehensive soft error and
hard fault tolerant 3D-NoC architecture, named 3D-Hard-
Fault-Soft-Error-Tolerant-OASIS-NoC (3D-FETO). The main
contributions of this work are summarized as follows:
• New fault-tolerant routing algorithm and architecture
which allows the system to handle soft errors and hard
faults at the same time.
• An efficient scheme for online control fault detection and
diagnosis in 3D-NoC systems.
• Quantitative evaluation and analysis of the proposed
system with different synthetic and realistic benchmarks.
ar
X
iv
:2
00
3.
09
61
6v
1 
 [c
s.A
R]
  2
1 M
ar 
20
20
local input_port
north input_port
east input_port
south input_port
west input_port
up input_port
down input_port
Switch Allocator
Crossbar
BLoD
ARQ
buffer
input buffer
request
input port manager
LAFT Routing
Data_to_Buffer 4444
ECC
arq_out arq_in
44x7
down-in
up-in
west-in
south-in
east-in
north-in
local-in
fault_managerprev_node next_node
BloD_cntrlInput_port_cntrl
SER-m
anager
Monitor
stop_out
RAB
Fig. 1. Adaptive 3D router (SHER-3DR) architecture.
II. Adaptive 3D Router Architecture (SHER-3DR)
Figure 1 shows the block diagram of the proposed adaptive
3D router architecture (SHER-3DR). The router relies on
simple recovery techniques based on system reconfiguration
with redundant structural resources to contain hard faults in
input-buffers, crossbar, and links, in addition to soft errors in
the routing pipeline stages.
The SHER-3DR router is the backbone component of the
3D-FETO system. Each router has a maximum number of
7-input and 7-output ports, where 6 input/output ports are
dedicated to the connection to the neighboring routers and
one input/output port is used to connect the switch to the local
computation tile. As shown in Fig. 1, the SHER-3DR contains
seven Input-port modules for each direction in addition to
the Switch-Allocator, and the Crossbar module which handles
the transfer of flits to the next neighboring node. There are
three pipeline stages on the router: Buffer-Writing (storing the
incoming flits), Next-Port-Computing/Switch-Allocator (Rout-
ing and Arbitrating) and Crossbar-Traversal (Crossbar).
In this section, we first review the hard-fault handling mech-
anism [4], including the Random-Access-Buffer (RAB) for
deadlock-recovery fault-tolerance, and the Bypass-Link-on-
Demand (BLoD) approach to handle multiple faulty channels
in the crossbar. Secondly, we present a soft error recovery
mechanism. In order to support detection and recovery, a light-
weight detection, diagnosis and recovery is depicted in the last
subsection.
A. Hard Fault Recovery Mechanism Overview
The hard fault recovery mechanisms [4] consist of Random
Access Buffer and Bypass-Link-on-Demand. The Random
Access Buffer mechanism (RAB) [4] solves the deadlock prob-
lem that can occur with the look-ahead fault-tolerant routing
algorithm (LAFT), and is able to recover from intermittent and
permanent faults in the input-buffer. When a fault is detected
in one of the slots, the main controller takes into consideration
the flagged slots when assigning the write and read addresses.
When a slot is marked as faulty, it will be avoided in reading
and writing processes. By using this mechanism, RAB can
handle the presence of the faults in input-buffer.
The Bypass Link on Demand mechanism (BLoD) [4] pro-
vides additional escape channels whenever the number of
faults in the baseline 7x7 crossbar increases. In the case where
a fault is detected in one or several crossbar links, the BLoD’s
controller disables the faulty crossbar links and enables the ap-
propriate number of bypass channels. The number of Bypass-
links is very important and it should be minimized as much
as possible to reduce the area and power overhead. In the case
where the number of faulty links are larger than the number
of backup links, the system needs to mark the router-to-router
connection as faulty and use the LAFT algorithm to avoid
it [4].
B. Soft Error Recovery Mechanism
The principle of soft-error handling method in 3D-FTO
relies on duplicating the pipeline stages computation (software
redundancy) in one more clock cycle, and detection is made
based on the difference between the computed results. If two
clock cycles have similar results, there is no soft-error. If two
consecutive results are different, there is a soft-error. In this
case, the system requires a third clock-cycle in order to correct
the failure. The failure is corrected by majority voting of three
results. To recover from soft errors in the data, the conventional
ECC (Error Correction Code) [6], [11] is adopted.
For ease of understanding, we provide the router pipeline
stages time-chart in Fig. 2. The Next Port Computation (NPC)
and Switch Allocation (SA) run in parallel as shown in Cycle
2 of Fig. 2. This is achieved by the LAFT routing algorithm
(described later), where the dependency between the two
stages is eliminated. The duplication is made for NPC, and
SA stages. After the first computation (in Cycle 2), all the
these stages have an additional computation clock. If a soft
error is detected, the whole pipeline is halted for correction.
In Cycle 1, flits are stored in the input buffer at the Buffer
Writing (BW) stage, and the ECC is used to check and correct
the input data in the ECC module. In Cycle 2, the NPC and
the SA are executed in parallel in the LAFT routing unit. In
Cycle 3, the Redundant NPC (RNPC) and the Redundant SA
(RSA) are computed in parallel. Then, if the output of RNPC
is equal to that of NPC, and SA is equal to RSA, the Crossbar
Traversal (CT) stage is performed in Cycle 3, and the flit goes
to the next router via the output channel in Cycle 4. If the
RNPC is not equal to the NPC, the system rolls-back and
recomputes the NPC. Moreover, if SA is not equal to RSA,
the system also rolls-back and re-computes the SA stage in
Cycle 4. The third results are attained from re-executing of
the failed stages. The router can determine the correct result
by using majority voting of three results.
Fig. 2. SHER-3DR Pipeline Stages Time-chart.
C. Light-weight Detection, Diagnosis and Recovery Mecha-
nism (DDRM)
Algorithm 1 shows the the proposed Detection, Diagnosis
and Recovery Mechanism (DDRM) approach. It uses the feed-
back from ECC and the Automatic Retransmission Request
(ARQ) protocol to monitor the errors. This mechanism is
operated by the Fault-manager module in Fig. 1. As shown
in Fig. 1, the input data is first verified by an ECC decoder.
If the value is correct or the ECC decoder can handle the
correction, the flit is written into the input buffer. Otherwise,
a retransmission is requested. Since the transient fault only
occurs in a short period of time, assumed to be a single
clock cycle, it does not occur in two consecutive cycles.
Therefore, ARQ can recover this kind of faults. However, if
a permanent fault occurs, ARQ is unable to correct it and
the faulty connection will keep the retransmission request
infinitely. Therefore, if the ARQ cannot correct the fault,
the system considers it as a permanent fault(line 1-10 in
Algorithm 1).
Since the flit’s correctness is verified by the ECC module
before being written into the buffer, a permanent fault can only
occur in the path between the input-buffer in the upstream node
and the one in the downstream node. This path includes input
buffers, crossbar and router-to-router channel.
For the diagnosis and recovery phase, the router’s Fault-
manager module initiates the diagnosis with input buffer
checking. In this step, the error status of the following flits of
the monitored input buffer is checked. If errors are repeated
at the same buffer position, Fault-manager concludes this
position is failed and sends information to Random Access
Buffer (RAB). If errors are detected at another buffer position,
the fault belong to crossbar or router-to-router link. In this
case, the algorithm changes to check crossbar.
In the crossbar checking step, an alternative link by Bypass-
Link-on-Demand (BLoD) is selected for another flit. If this
flit is healthy after its transmission, the bypass-link fixed the
fault. The configuration of BLoD is keep as recovery. If this
flit is still failed, the bypass-link was unable to fix it. The fault
belong to the router-to-router channel. The fault information
is sent to Look-Ahead Fault-Tolerant routing modules to avoid
the connection as recovery.
III. Look-Ahead-Fault-Tolerant Routing Algorithm
To keep the benefits of look-ahead routing, Look-Ahead-
Fault-Tolerant routing algorithm (LAFT) [1] should be able
to perform the routing decision for the next node taking into
consideration its link status and select the best minimal path.
The fault link information received from the Fault manager
(which handles the Detection Diagnosis Recovery Mechanism)
are read by each input-port where LAFT is executed. Algo-
rithm 2 illustrates the LAFT algorithm. The first phase of
this algorithm calculates the next node address depending
on the Next-port identifier read from the flit. For a given
node wishing to send a flit to a given destination, there
exist at most three possible directions through X, Y, and Z
dimensions, respectively. In the second phase, LAFT performs
the calculation of these three directions. By the end of this
second phase, LAFT has information about the next node’s
fault status and also the three possible directions for a minimal
routing. In the next phase, the routing selection is performed.
For this decision, we adopted a set of prioritized conditions
to ensure fault-tolerance and high performance either in the
presence or absence of faults:
1) The selected direction should ensure a minimal path and
it is given the highest priority in the routing selection.
2) We should select the direction with the largest next-hope
path diversity.
3) The congestion status is given the lowest priority.
By the end of the selection, a routing path with largest diversity
and lowest congestion is selected. If there is no minimal
routing path due to the presence of faults, another selection is
applied for other non-minimal routing directions.
IV. Evaluation Results
The proposed 3D-FETO system was designed in Verilog-
HDL, synthesized and prototyped with commercial CAD tools
and VLSI technology, respectively [12], [13]. We evaluate
the hardware complexity of 3D-FETO router in terms of area
Algorithm 1: Fault Detection, Diagnosis and Recovery.
// Automatic Retransmission Request
Input: transmitting f lit
// Transmitted Buffer Position
Input: bu f f er position
// Control signal to all Fault-Tolerance modules
Output: RAB control, BLoD control, LAFT control
// Transmit the flit, get the ECC’s feedback
1 Transmit(transmitting f lit);
2 ECC result = ECC-Decoder(transmitting f lit);
// DETECTION PHASE:
3 if ECC result == ARQ then
// Automatic Retransmission Request
4 increase(ARQ counter);
5 ARQ(transmitting f lit);
6 else
// The transmitted flit is non faulty
7 Finish;
8 end
// Check the number of consecutive ARQs
9 if (ARQ counter == 2) then
// There is a permanent fault
// Jump to DIAGNOSIS-RECOVERY PHASE
10 end
// DIAGNOSIS-RECOVERY PHASE:
// Start with Input Buffer Checking
11 Bu f f er Failure← Bu f f er Checking(bu f f er position);
12 if (Bu f f er Failure == Yes) then
// Random Access Buffer is received the position to handle.
13 RAB Control = bu f f er position;
14 Finish;
15 else
// The buffer slot is non faulty.
// Move to Crossbar Checking: using a Bypass-Link.
16 BLoD control = enable;
// Get the ECC’s feedback and detect with ARQ counter.
17 if (ARQ counter == 2) then
// BLoD cannot fix the fault, the link is failed.
18 BLoD control = release;
// The LAFT routing algorithm handles the faulty link.
19 LAFT control = faulty;
20 Finish;
21 else
// BLoD already fixed the failure, the recovery step is
finished.
22 Finish;
23 end
24 end
utilization, power consumption (static and dynamic) and speed.
To evaluate the performance of the proposed system, we select
both synthetic and realistic traffic patterns as benchmarks. For
synthetic benchmarks, we select Transpose , Uniform, Matrix-
multiplication, and Hotspot 10%. For realistic benchmarks,
we choose traffic patterns of H.264 video encoding system,
Video Object Plane Decoder (VOPD), Picture In Picture (PIP)
and Multiple Window Display (MWD) [14]. The simulation
configurations are depicted in Table I.
We evaluate the performance of our fault-tolerant model
which includes hard fault tolerance from 3D-FTO [4], soft-
error resilience (SER-OASIS), and the proposed system (3D-
FETO). We measure the network transmission time, or end-
to-end latency, with the selected synthetic and realistic bench-
marks. To understand the impact of fault-tolerant techniques
on performance, we compare the obtained results with the
baseline 3D-NoC system presented in [1]. We randomly inject
faults with three fault-rates: 10%, 20% and 33%.
A. Latency Evaluation
In the first experiment, we evaluate the performance of
the proposed architecture in terms of latency over various
Algorithm 2: Look-Ahead-Fault-Tolerant Routing.
// Destination address
Input: Xdest , Ydest , Zdest
// Current node address
Input: Xcur , Ycur , Zcur
// Next-port identifier
Input: Next-port
// Link status information
Input: Fault-in
// New-next-port for next node
Output: New-next-port
// Calculate the next-node address
1 Next← Next-node (Xcur , Ycur , Zcur , Next-port);
// Read fault information for the next-node
2 Next-fault← Next-status (Fault-in, Next-port);
// Calculate the three possible directions for the next-node
3 Next-dir← poss-dir (Xdest , Ydest , Zdest , Nextx, Nexty, Nextz);
// Evaluate the diversity number of three minimal paths
4 Div1 ← path-div (Xdest , Ydest , Zdest , poss − dir1);
5 Div2 ← path-div (Xdest , Ydest , Zdest , poss − dir2);
6 Div3 ← path-div (Xdest , Ydest , Zdest , poss − dir3);
// Evaluate the New-next-port direction
7 if (|Next-dir| > 1) then
8 if (Div1==Div2==Div3) then
9 New-next-port ← min-congestion (poss − dir1, poss − dir2,
poss − dir3);
10 else
11 New-next-port ← max-diversity (poss − dir1, poss − dir2, poss − dir3);
12 end
13 else
14 if (Next-dir == 1) then
15 New-next-port← Next − dir1;
16 else New-next-port← nonminimal (Xdest , Ydest , Zdest , Xcur , Ycur , Zcur ,
Fault-in);
17 end
benchmark programs and error injection rates for three system
configurations: (1) hard-fault tolerant system (3D-FTO), (2)
soft-error resilience system (Soft Error Resilience OASIS),
and (3) Hard-fault and Soft-error tolerant system (3D-FETO).
The simulation results are shown in Fig. 3.(a-h). From these
graphs, we notice that with 0% hard faults (in input buffer and
crossbar only), 3D-FTO has almost similar performance as the
baseline system (LAFT-OASIS). In addition, we found that
even at 33% fault-rate, 3D-FTO increases the latency by only
1.71%, 11.38%, 8.79% and 13.73% for Transpose, Uniform,
6 × 6 Matrix, and Hotspot-10%, respectively. With realistic
benchmarks, the performance of 3D-FTO slightly degrades at
low error-rates; but, it suffers from more impact with high
error-rates (20% and 33%) since the important connectivity
encounters bottlenecks due to errors inside the input buffers.
However, the proposed 3D-FETO model is still working even
at high fault-rates while the baseline model collapses even at
5% error-rate.
We used the same benchmark programs to evaluate the soft
error resilience (SER) model. Since both of proposed Soft
Error Resilience and ECC require additional clock cycles,
we can observe the significant effect on network transmission
time. With 0%, 10%, 20% and 33% of fault-rates, the SER
model increases the average delay in Transpose benchmarks
by 18.57%, 28.74%, 34.54% and 49.62%, respectively.
Finally, we evaluate the proposed 3D-FETO system with
both soft error and hard faults handling schemes. As shown in
Fig. 3(a-h), 3D-FETO has demonstrated a significant impact
on the average latency which is mostly doubled in both
realistic and synthetic benchmarks. At 33% of fault-rates in
Matrix, Uniform, Transpose benchmarks, 3D-FETO’s average
latency increases by 78.44%, 50.73% and 67.18% in terms of
average packet latency. However, it still maintains the ability
of working under an extremely high fault-rate (33% of hard
faults and 33% of soft errors).
B. Throughput Evaluation
Figure 3 depicts the throughput evaluation with the adopted
synthetic benchmarks. At 0% error rate, 3D-FTO presents the
best throughput which is matched to the capacity of the base-
line LAFT-OASIS. The Soft Error Resilience OASIS and the
proposed 3D-FETO have smaller throughput due to their soft
error resilient mechanism. When the errors are injected into the
system, we can observe a degradation in throughput. Thanks to
the efficient hard fault tolerance scheme and the fault-tolerant
routing algorithm, 3D-FTO at 33% error-rate provides a slight
decreased throughput: 40.18%, 43.96%, 43.55% and 32.59%
for Transpose, Matrix, Uniform and Hotspot 10% respectively.
For the Soft Error Resilience OASIS, the system requires re-
transmission in ARQ mechanism and re-execution in soft error
mechanism. Therefore, the throughput is degraded due to extra
clock cycles. The proposed 3D-FETO, which is a fusion of
both hard fault tolerance and soft error resilience mechanisms,
inherits both degradation. However, these systems provides the
ability of error handling up to 33% (the limitation of soft error
mechanism).
C. Complexity Evaluation
Table II illustrates the hardware complexity results of 3D-
FETO in terms of area, power (static, dynamic, and total),
and speed. In the hard fault tolerance router (3D-FTO), the
area and power consumption overheads have increased by
1.01% and 25.65%, respectively. The maximum speed has also
slightly decreased. On the other hand, our soft error handling
mechanism adds seven ARQ buffers and some combinational
logics which increase the area and power consumption signif-
icantly. However, the proposed 3D-FETO model introduces
7.50% and 3.73% extra area and power consumption, re-
spectively, when compared to soft error resilience model. In
comparison to the baseline model, 3D-FETO increases the area
and power consumption by 56.39% and 112.10%, respectively,
while the maximum speed decreases by 33.70%.
Although our proposed models are penalized in terms of
area, power consumption, and maximum frequency due to
additional logics and registers that are necessary for fault
handling mechanisms, they provide a full resiliency against
a significant amount of soft errors and hard faults.
V. Conclusion and Future Work
In this paper, we proposed a comprehensive fault toler-
ant 3D-Network-on-Chip (3D-NoC) system architecture for
highly-reliable many-core Systems-on-Chips (SoCs), named
3D-FETO. The proposed system is based on two approaches.
First, a comprehensive mechanism to handle both soft error
and hard faults in a 3D-NoC router is proposed. In the second
approach, the system can support detection, diagnosis and re-
covery technique which makes it independent of any complex
TABLE I
Simulation Configuration.
Parameter/System Value
Network Size (z × y × x)
Matrix 3 × 6 × 6
Transpose 4 × 4 × 4
Uniform 4 × 4 × 4
Hotspot 10% 4 × 4 × 4
H.264 3 × 3 × 3
VPOD 2 × 2 × 3
MWD 3 × 2 × 2
PIP 2 × 2 × 2
Matrix 10
Node’s Delivered Packets Transpose 10
per transmission session Uniform 128
Hotspot 10% 128
H264 8,400
Network’s Delivered Packets VPOD 3,494
per transmission session MWD 1,120
PIP 512
Packet’s Size Hotspot 10% 10 flits + 10% for hotspotOthers 10 flits
Flits Size 44 bits
Header Size 14 bits
Payload Bit Baseline, 3D-FTO 30 bitsSER, 3D-FETO 18 bits
Parity Bit Baseline, 3D-FTO 0 bitsSER, 3D-FETO 12 bits
Buffer Depth 4
TABLE II
Hardware Complexity Evaluation.
Area Power Speed
Model (µm2) (mW) (Mhz)
Static Dynamic Total
Baseline LAFT 18,873 5.1229 0.9429 6.0658 925.28
3D-FTO 19,143 6.4280 1.1939 7.6219 909.09
Soft Error Resilience 27,457 9.7314 2.6710 12.4024 625.00
3D-FETO 29,516 10.0819 2.7839 12.8658 613.50
and costly testing mechanisms commonly found in conven-
tional systems. Through extensive evaluation, we showed that
the proposed 3D-FETO was able to recover efficiently from
a significant number of soft and hard errors at different fault-
rates reaching up to 33%. Despite the performance degradation
and hardware complexity penalty, we still consider that this
overhead is acceptable.
Acknowledgment
This project is partially supported by Competitive Research
Funding (CRF), University of Aizu, Japan, Ref.P-5-12, and
JSPS Kakenhi Research Grant, Japan, Ref.30453020.
References
[1] A. Ben Ahmed and A. Ben Abdallah, “Architecture and design of high-
throughput, low-latency, and fault-tolerant routing algorithm for 3D-
network-on-chip (3D-NoC),” The Journal of Supercomputing, vol. 66,
no. 3, pp. 1507–1532, 2013.
[2] A. DeOrio, D. Fick, V. Bertacco, D. Sylvester, D. Blaauw, J. Hu,
and G. Chen, “A reliable routing architecture and algorithm for nocs,”
Computer-Aided Design of Integrated Circuits and Systems, IEEE Trans-
actions on, vol. 31, no. 5, pp. 726–739, 2012.
[3] K. Constantinides, S. Plaza, J. Blome, B. Zhang, V. Bertacco, S. Mahlke,
T. Austin, and M. Orshansky, “Bulletproof: A defect-tolerant cmp switch
architecture,” in High-Performance Computer Architecture, 2006. The
Twelfth International Symposium on, pp. 5–16, IEEE, 2006.
[4] A. B. Ahmed and A. B. Abdallah, “Adaptive fault-tolerant architecture
and routing algorithm for reliable many-core 3d-noc systems,” Journal
of Parallel and Distributed Computing, vol. 9394, pp. 30 – 43, 2016.
 0
 10
 20
 30
 40
 50
0% 10% 20% 33%
Av
era
ge
 La
ten
cy 
(cy
cle
s/p
ack
et)
Probability of injected errors (%)
Baseline LAFT-OASIS
3D-FTO
Soft-Error Tolerance OASIS
3D-FETO
(a) Transpose’s Average Packet Latency
 0
 10
 20
 30
 40
 50
 60
0% 10% 20% 33%
Av
era
ge
 La
ten
cy 
(cy
cle
s/p
ack
et)
Probability of injected errors (%)
Baseline LAFT-OASIS
3D-FTO
Soft Error Tolerance OASIS
3D-FETO
(b) Uniform’s Average Packet Latency
 0
 5
 10
 15
 20
 25
 30
 35
 40
0% 10% 20% 33%
Av
era
ge
 La
ten
cy 
(cy
cle
s/p
ack
et)
Probability of injected errors (%)
Baseline LAFT-OASIS
3D-FTO
Soft Error Tolerance OASIS
3D-FETO
(c) 6 × 6 Matrix’s Average Packet Latency
 0
 10
 20
 30
 40
 50
 60
0% 10% 20% 33%
Av
era
ge
 La
ten
cy 
(cy
cle
s/p
ack
et)
Probability of injected errors (%)
Baseline LAFT-OASIS
3D-FTO
Soft Error Tolerance OASIS
3D-FETO
(d) Hotspot 10%’s Average Packet Latency
 0
 50
 100
 150
 200
 250
0% 10% 20% 33%
Av
era
ge
 La
ten
cy 
(cy
cle
s/p
ack
et)
Probability of injected errors (%)
Baseline LAFT-OASIS
3D-FTO
Soft Error Tolerance OASIS
3D-FETO
(e) H.264 Encoder’s Average Packet Latency
 0
 2
 4
 6
 8
 10
 12
 14
0% 10% 20% 33%
Av
era
ge
 La
ten
cy 
(cy
cle
s/p
ack
et)
Probability of injected errors (%)
Baseline LAFT-OASIS
3D-FTO
Soft Error Tolerance OASIS
3D-FETO
(f) VOPD’s Average Packet Latency
 0
 5
 10
 15
 20
 25
 30
0% 10% 20% 33%
Av
era
ge
 La
ten
cy 
(cy
cle
s/p
ack
et)
Probability of injected errors (%)
Baseline LAFT-OASIS
3D-FTO
Soft Error Tolerance OASIS
3D-FETO
(g) MWD’s Average Packet Latency
 0
 5
 10
 15
 20
 25
 30
0% 10% 20% 33%
Av
era
ge
 La
ten
cy 
(cy
cle
s/p
ack
et)
Probability of injected errors (%)
Baseline LAFT-OASIS
3D-FTO
Soft Error Tolerance OASIS
3D-FETO
(h) PIP’s Average Packet Latency
 0
 0.2
 0.4
 0.6
 0.8
 1
0% 10% 20% 33%
Th
rou
gh
pu
t (f
lits
/no
de
/cy
cle
)
Probability of injected errors (%)
Baseline LAFT-OASIS
3D-FTO
Soft-Error Tolerance OASIS
3D-FETO
(i) Transpose’s Throughput
 0
 0.2
 0.4
 0.6
 0.8
 1
0% 10% 20% 33%
Th
rou
gh
pu
t (f
lits
/no
de
/cy
cle
)
Probability of injected errors (%)
Baseline LAFT-OASIS
3D-FTO
Soft Error Tolerance OASIS
3D-FETO
(j) Uniform’s Throughput
 0
 0.2
 0.4
 0.6
 0.8
 1
0% 10% 20% 33%
Th
rou
gh
pu
t (f
lits
/no
de
/cy
cle
)
Probability of injected errors (%)
Baseline LAFT-OASIS
3D-FTO
Soft Error Tolerance OASIS
3D-FETO
(k) 6 × 6 Matrix’s Throughput
 0
 0.2
 0.4
 0.6
 0.8
 1
0% 10% 20% 33%
Th
rou
gh
pu
t (f
lits
/no
de
/cy
cle
)
Probability of injected errors (%)
Baseline LAFT-OASIS
3D-FTO
Soft Error Tolerance OASIS
3D-FETO
(l) Hotspot 10%’s Throughput
Fig. 3. Average Packet Latency and Throughput evaluation.
[5] S. Lin, D. Costello, and M. Miller, “Automatic-repeat-request error-
control schemes,” Communications Magazine, IEEE, vol. 22, no. 12,
pp. 5–17, 1984.
[6] D. Bertozzi, L. Benini, and G. De Micheli, “Error control schemes for
on-chip communication links: the energy-reliability tradeoff,” Computer-
Aided Design of Integrated Circuits and Systems, IEEE Transactions on,
vol. 24, pp. 818–831, June 2005.
[7] Q. Yu and P. Ampadu, “Transient and permanent error co-management
method for reliable networks-on-chip,” in Networks-on-Chip (NOCS),
2010 Fourth ACM/IEEE International Symposium on, pp. 145–154,
IEEE, 2010.
[8] D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler,
D. Blaauw, T. Austin, K. Flautner, et al., “Razor: A low-power pipeline
based on circuit-level timing speculation,” in Microarchitecture, 2003.
MICRO-36. Proceedings. 36th Annual IEEE/ACM International Sympo-
sium on, pp. 7–18, IEEE, 2003.
[9] Q. Yu, M. Zhang, and P. Ampadu, “Addressing network-on-chip router
transient errors with inherent information redundancy,” ACM Transac-
tions on Embedded Computing Systems (TECS), vol. 12, no. 4, p. 105,
2013.
[10] K. N. Dang, M. Meyer, Y. Okuyama, A. Ben Abdallah, and X.-T. Tran,
“Soft-error resilient 3d network-on-chip router,” in Awareness Science
and Technology (iCAST), 2015 IEEE 7th International Conference on,
pp. 84–90, IEEE, 2015.
[11] Q. Yu and P. Ampadu, Transient and Permanent Error Control for
Networks-on-Chip. Springer, 2012.
[12] NCSU Electronic Design Automation, “FreePDK3D45
3D-IC process design kit,” Avaialable:
http://www.eda.ncsu.edu/wiki/FreePDK3D45:Contents, 2015.
[13] NanGate Inc., “Nangate Open Cell Library 45 nm,” Avaialable:
http://www.nangate.com/, 2014.
[14] D. Bertozzi, A. Jalabert, S. Murali, R. Tamhankar, S. Stergiou, L. Benini,
and G. De Micheli, “Noc synthesis flow for customized domain spe-
cific multiprocessor systems-on-chip,” Parallel and Distributed Systems,
IEEE Transactions on, vol. 16, no. 2, pp. 113–129, 2005.
