A design methodology and various performance and fabrication metrics evaluation of 3D Network-on-Chip with multiplexed Through-Silicon Vias by Said, M. et al.
A Design Methodology and Various Performance and
Fabrication Metrics Evaluation of 3D Network-on-Chip
with Multiplexed Through-Silicon Vias
Mostafa Said1, Ahmed Shalaby2, Farhad Mehdipour3, Morteza Biglari-Abhari4, Mohamed El-Sayed2
1Department of Electrical Engineering, Faculty of Engineering, Assiut University, Assiut, Egypt
2Department of Electronics and Communications, Egypt-Japan University of Science and Technology (E-JUST), Alexandria, Egypt
3E-JUST Center, Graduate School of Information Science and Electrical Engineering, Kyushu University, Fukuoka, Japan
4Department of Electrical & Computer Engineering, University of Auckland, Auckland, New Zealand
Email: mostafa.saied@ejust.edu.eg, Ahamed.Shalaby@ejust.edu.eg, farhad@ejust.kyushu-u.ac.jp,
m.abhari@auckland.ac.nz, m.ragab@ejust.edu.eg
Abstract—The use of short Through-Silicon Vias (TSVs) in
3D integration Technology introduces a significant reduction in
routing area, power consumption, and delay. Although, there
are still several challenges in 3D integration technology; mainly
low yield, which is a direct result of extra fabrication steps
of TSVs. Therefore, reducing TSV count has a considerable
effect on improving yield and hence reducing cost. A TSV
multiplexing technique called TSVBOX was introduced in [1]
to reduce the TSV count without affecting the direct benefits
of TSVs. Although, the TSVBOX introduces some delay to the
signals to be multiplexed, this delay effect of TSV multiplexing is
not addressed yet. In this paper, we analyze the TSVBOX timing
requirements and propose a design methodology for TSVBOX-
based 3D Network-on-Chip (NoC). Then performance and power
comparisons are conducted to investigate the direct effects of
TSV multiplexing on these two metrics. After that the basic
fabrication metrics are compared to investigate the effect of the
proposed design methodology on yield and cost. We show that the
TSVBOX extremely enhances the fabrication metrics at minimal
degradation in performance and power consumption, especially
for Hotspot-like traffic patterns.
I. INTRODUCTION
Conventional 2D integration proves to have many limita-
tions for nowadays large systems needs. For example, long
wires increase power consumption and routing area. Also,
it adds great difficulty to distribute the clock signal with
minimum delay for large systems. Such problems make 2D
integration unable to follow Moore’s law any more [2].
On the other side, 3D integration is an emerging technology
that can mitigate the main limitations of conventional 2D
integration. However there are still some challenges that need
more and more focus and research work to make such promis-
ing technology mature and reliable. One of these challenges is
the reliability issues in terms of yield and cost. 3D-ICs show
very low yield due to extra fabrication steps for bonding dies
or wafers to each other to create the 3D stack. These extra
steps may result in faulty TSVs due to misalignment of TSVs,
partially filled ones, etc. [3]. What makes the situation worse,
is that the probability of having faulty TSVs increases as the
total number of TSVs increases. Hence, finding a technique
that reduces TSV count without affecting the benefits gained
by 3D integration is very important. In [1, 4, 5] the TSV count
has been reduced by multiplexing, serialization, or virtualiza-
tion, respectively. The TSV multiplexing technique introduced
in [1] reduces the number of TSVs by half, by multiplexing
each two 3D signals1 into one signal and passing this signal
through one TSV instead of two as in the conventional 3D-
ICs. Therefore almost 50% reduction in the number of TSVs
is achieved. Due to the significant reduction in TSV count,
the analysis done in [1] on yield has revealed very high
improvement over conventional 3D-ICs.
The investigation in [1] covered area, yield, cost, and
power consumption analysis. In [6, 7], other physical effects
of reducing TSV count are studied. In [6], the impact of
reducing TSV count on maximum temperature is investigated,
and as expected the maximum temperature increases as the
TSV count decreases. TSVs are usually fabricated using low
thermal resistivity materials such as copper or tungsten [2].
Therefore reducing TSV count will increase the total thermal
resistance of the 3D stack. In [7] the residual thermal-stress
impact created during bonding process in fabrication has been
studied carefully to accurately determine the Keep-Out-Zone
area overhead around TSVs in order to accurately estimate the
yield of the 3D stack.
The TSVBOX uses extra selection signal (S) to control the
multiplexer (MUX) and the demultiplexer (DeMUX). This S
signal introduces some delay to one of the multiplexed signals
besides the parasitics of the TSVBOX itself. Such delay may
affect the functional validity of the system to be implemented
using TSVBOX. Although, [1, 6, 7] address most of the issues
related to TSVs multiplexing and show the advantages and
limitations of this technique, the timing requirements and the
design methodology based on TSV multiplexing have not
1A 3D signal is the signal that traverses from one layer to another in the
3D stack
been studied yet. None of the above related works shows
any system implementation of a circuit using TSVBOX, so its
functionality and applicability in system level have not been
proven yet.
Due to its scalability and novelty as a multicore commu-
nication architecture for future multiprocessor SoCs, 3D NoC
is selected as our target system architecture for applying the
TSVBOX technique. In this paper, timing requirements for
TSVBOX-based 3D NoC are investigated so that the delay
introduced by the TSVBOX is mitigated and the performance
degradation and incorrect operation are avoided. Also, the
design methodology for the TSVBOX-based 3D NoCs is
introduced. For the sake of comparison, two versions of
3D NoCs are introduced, one based on conventional 3D
integration without TSV multiplexing, while the other version
is TSVBOX-based. Finally, the main aspects of the NoC ar-
chitecture in terms of performance and power are investigated
under different simulation scenarios.
The contribution of this paper can be summarized in the
following points:
• A Low-level circuit model for TSVBOX is introduced,
in order to estimate the TSVBOX delay and power con-
sumption; showing possible RC parasitics of its different
components.
• Studying the timing requirements of the TSVBOX, de-
termining the selection signal properties and its relation
to the main clock signal in the case of 3D NoCs.
• Introducing a complete design methodology with detailed
steps required to design a 3D NoC involving TSVBOX.
• Investigating the most important aspects of the 3D NoC;
performance and power consumption, showing the cases
at which the TSVBOX does not affect these metrics.
• Proposing analytical models to compare the basic fab-
rication metrics; yield and cost of the conventional and
TSVBOX-based 3D NoCs.
Finally, this work is an extensive extension of the work of
[8] in many ways. All differences between this work and the
work of [8] is included in Table I. According to our estimation
the extra work is more than 60% of the work of [8].
The rest of this paper is organized as follows: Section II
presents the details of the TSVBOX technique. Section III
explores the architecture and design of the target 3D NoC and
the 3D router used. Sections IV and V highlight the TSVBOX
parasitic model and various design aspects. Section VI in-
troduces the TSVBOX timing requirements analysis and the
TSVBOX-based 3D NoC design methodology. The scalability
analysis is investigated in Section VII, while the fabrication
yield and cost are analyzed in Section VIII. Simulation results
are discussed and presented in Section IX. Finally, Section X
concludes the paper.
II. TSVBOX
Fig. 1a shows the TSVBOX structure. As shown in Fig. 1b,
the two inputs of the TSVBOX multiplexer (MUX) are the
two signals (V1, V2) that are supposed to be multiplexed
through single TSV. The S signal (Fig. 1c) is the signal
that controls MUX and DeMUX, and its clocking period TS
is at least equivalent to double the delay of the TSVBOX
(Td−TSV BOX), where Td−TSV BOX is the delay from the
input point V1 (V2) to the output point V1
′
(V2
′
). Assuming
that during the first half cycle V2 is selected, the TSVBOX
circuit will hold the charge of V2 during the second half cycle,
therefore
V
′
2 (t) = V2(t−
TS
2
) (1)
During the second half cycle similar behavior for V1
′
is
repeated but with another time shift equal to TS2 due to the
waiting of the selection process of the first half cycle, so at
the end we have
V1
′
(t) = V1(t− TS) (2)
Fig. 2 shows all the TSVBOX signals and how V1 and V2 are
affected with the TSVBOX delays. For more details about the
TSVBOX functionality refer to [1]. In [1] it is assumed that
the delay incurred by the TSVBOX TSTCLK , therefore this
delay can be neglected and there would be no incorrectness
issues in reading the multiplexed voltages. In real application
this situation is feasible for example when both 3D paths of
V1
′
and V2
′
are not part of the critical path, and their delay
is at least less than the critical path delay by TS . However,
the general situation when TS is comparable to the clock
signal is not addressed in [1]. Therefore, there are some timing
requirements that the TSVBOX must fulfill for the sake of
correct operation. In this paper all conditions or requirements
will be studied for 3D NoC architecture.
III. THE TARGET 3D NOC ARCHITECTURE
Fig. 3 shows the target 3D NoC architecture, which relies on
a 4×4×4 mesh topology. In [9] an 80 core NoC was presented,
so our target NoC matches the trend of NoC domain. For
simplicity, each router has core concentration of one [10],
which means only one processor core is connected to the local
port of the router. To achieve strong fairness between internal
requests of the router, separable allocation is adopted with
Round-Robin arbitration [11]-[12]. The size of each injected
packet is five flits, and each flit in turn is 64 bits. The head flit
contains the routing information while the others are supposed
to carry the data. We choose wormhole switching and XYZ
deterministic routing to be the 3D NoC switching technique
and routing algorithm [13], respectively. Each input port in the
router contains two virtual channels of size one flit while the
local virtual channel buffers are assumed to be with infinite
size to serve isolating traffic injection process from the NoC.
Therefore the delay after generating the flit and before entering
source buffer is accounted for as well as the delay in the source
queue buffers [14]. According to [15, 16] the total system area
can be assumed 400 mm2, thus the area of each layer of the tar-
get 3D NoC can be assumed 10×10 mm2. Hence the length of
the horizontal interconnect wires between two neighbor routers
are 2.5 mm [16]. For the vertical interconnects, we choose the
TSV capacitance to be a parameter in our simulations. The
change in TSV capacitance reflects the change in TSV length
TABLE I: Differences and extra contributions of this extended work in comparison to the work of [8].
Facets of comparison Work of [8] This paper
3D NoC adopted 3×3×2 4×4×4
Switiching technique Store-And-Forward Wormhole
Virtual channels (VCs) Single buffer per port 2 VCs per port
Technology adopted 180 nm More finer 65 nm technology
TSV capacitance Fixed at 15 fF Considered as a parameter with 15 fF to 500 fF
to reflect different TSV technologies
TSVBOX multiplexing ratio used Fixed at 2×1 Considered as a general parameter NMUX×1, NMUX≥2
Yield and cost analysis - Included in Section VIII
Analytical analysis on the effect of different
traffic patterns on the performance of TSVBOX - Included in Subsection VII-D
Power consumption evaluation - Included in Subsection IX-E
Scalability analysis - Included in Section VII
2x1 
MUX
1x2 
DeMUX
V1
Vo
V2
V1'
V2'
S
TSV
(a) TSVBOX schematic
Vo
S S
S S
S S
TSV
V1
V2 V2'
V1'
(b) TSVBOX circuit
TS
(c) Selection signal S
Fig. 1: (a) TSVBOX schematic, (b) TSVBOX circuit imple-
mentation, (c) Selection signal S.
and the resistivity of the substrate bulk used which in turn
reflects different 3D integration technologies [17]-[20].
The data bus width NBW is assumed to be equivalent to
the flit size, as shown in Figs. (5a,5b). For the conventional
3D NoC, the whole 3D data bus width is NBW+2, where the
extra two bits are required for the handshaking communication
protocol which needs request (REQ) and acknowledgement
(ACK) signals [21]. For the TSVBOX case, the data bits
of the packet are multiplexed and hence NBW2 +2 TSVs are
required, including REQ and ACK signals. However for
each vertical bus width two extra TSVs are required to
transfer the S signal and its inverted version S. Therefore, the
vertical connection bus width is NBW2 +4 for the TSVBOX-
based 3D NoC. Any two neighbor routers are connected by
TS
TS
S
V2
V1
V2’
V1’    
2
t
t
t
t
   
CLK
t
   
   
   
   
TS
t
I II
Region I and II
Fig. 2: Various TSVBOX signals; S(t), V1(t), V2(t),
V1
′
(t), and V2
′
(t).
2.5 mm
Switch
IP Processor 
Tile 
2D Interconnects 
(global wires)
3D Interconnects 
(TSVs)
Fig. 3: 4×4×4 3D NoC architecture.
two opposite unidirectional channels. Thus the vertical port
contains 2(NBW+2) TSVs for conventional 3D NoC, and
2(NBW2 +4) for TSVBOX case.
Since SPICE models for such system would be too com-
plicated and time consuming either in design or simulation,
SystemC-A is used for our 3D NoC implementation. SystemC
supported by the Open SystemC Initiative (OSCI) [22] is an
open source language available to meet the ever-increasing
needs of system-level design and SoC technologies. Using
SystemC-A, high- and low-levels of implementation can be
done together for a system. For processor cores, routers,
and intra-layer interconnects, we use behavioral system im-
plementation, while for inter-layer interconnects (the vertical
connections), we rely on a low-level circuit implementation to
be able to estimate delays and power consumption.
IV. MODELING
As shown in Fig. 4a, the 3D signal is assumed to pass
through an input inverter driver, a global wiring segment in
the first layer, a TSV, a global wiring segment in the second
layer, and an output inverter driver. The output inverter driver
is assumed always 1x-inverter (minimum size inverter). For
the TSV and wiring circuit models, the models introduced in
[2, 23] are used, which are shown in Figs. 4(b,c), respectively.
The TSVBOX is composed of MUX and DeMUX circuits
and a TSV in between. The MUX or the DeMUX is composed
of two transmission gates and each transmission gate is
composed of two transistors. Therefore to model TSVBOX,
a transistor circuit model that depicts different transistors’
RC parasitics is required. Referring to [24], the transistor
parasitics can be modeled as shown in Fig. 4(d,e). The
parasitics of this model are: ON resistance of NMOS (RonN ),
ON resistance of PMOS (RonP ), NMOS source/drain-bulk ca-
pacitance (CsbN /CdbN ), PMOS source/drain-bulk capacitance
(CsbP /CdbP ), NMOS gate capacitance (CgN ), and PMOS gate
capacitance (CgP ). According to [24], CsbN=CdbN=CN for
NMOS and CsbP=CdbP=CP for PMOS. Also, for equivalent
TABLE II: Transistor and other miscellaneous parasitics for
65 nm.
Transistor
parasitic Unit 65 nm
CgP fF 0.0689
CgN fF 0.0689
CsbP fF 0.0832
CsbN fF 0.0819
CdbP fF 0.0832
CdbN fF 0.0819
RonP kΩ 44.462
RonN kΩ 21.077
CW fF 25
RW kΩ 0.04132
|VthP | V 0.39
VthN V 0.4
VDD V 1
CL fF 1.3
Cp fF 1.5
V
TSV
Global wiring 
segments
Output 1x-inverter 
driver
Input inverter 
driver
RW
CW
RTSV
CTSV
2
RTSV
2
(a) (b) (c)
d
s
gg
d
s
RonN
CdbN
CsbN
CgN
s
d
gg
s
d
RonP
CsbP
CdbP
CgP
(d) (e)
Fig. 4: (a) The 3D signal path and circuit models of (b) TSV,
(c) global wiring, (d) NMOS, and (e) PMOS.
NMOS and PMOS sizes, both NMOS and PMOS gate capaci-
tances’ values are equivalent so CgP=CgN=Cg .
A. Conventional 3D-IC 3D signal path modeling
The conventional 3D-IC 3D signal path is shown in Fig. 6
where the signal is assumed to pass through an inverter driver
(represented by its ON resistance Rdr−Conv), a global wiring
segment in the first layer, a TSV, a global wiring segment in
the second layer, and a load capacitance which is assumed
to be the input gate capacitance of a 1x-inverter driver in the
second layer.
B. TSVBOX 3D signal path modeling
Fig. 7 shows the TSVBOX circuit model. It is similar to
the circuit of the conventional 3D signal path, the difference
is that the equivalent RC parasitic circuit of the transistors in
MUX and DeMUX are involved. Since there is no transistor
models in SystemC-A, the transistors of the transmission gate
are modeled using perfect switches. The S signal controls the
upper transmission gates of the MUX and the DeMUX, and its
inverted version S controls the lower ones. Therefore both the
lower and upper transmission gates will switch ON or OFF
exclusively as required in the original TSVBOX design. The S
signal path shown in Fig. 7, is similar to the conventional 3D
signal path. However, since the S signal is driving the gates
of the transmission gates, therefore for each S path, there are
four gate capacitances 4Cg involved in the load; 2Cg from
MUX and 2Cg from DeMUX.
V. DESIGN PARAMETERS AND PARASITICS
In this section, parasitics values and technology parameters
are introduced and the design considerations are detailed.
A. Technology parasitics and parameters
In this study 65 nm is selected to be our target technology.
Technology parasitics and parameters spans the NMOSs’ and
PMOSs’ parasitics shown in Fig. 4 and also their threshold
voltages. It also covers the global wiring parasitics which are
used for IPs and multicore 3D-ICs [25] and the input and
Input 
Controller
Output 
Controller
Input 
Controller
Output 
Controller
Data bus 
width NBW
Data bus 
width NBW
Handshaking  signals  
REQ and ACK
3D Router (Layer 1) 3D Router (Layer 2)
(a)
Input 
Controller
Output 
Controller
Input 
Controller
Output 
Controller
Data bus 
width NBW /2
Data bus 
width NBW /2
S
S
S
S
3D Router (Layer 1) 3D Router (Layer 2)
Handshaking  signals  
REQ and ACK
(b)
Fig. 5: Full duplex transmission for (a) conventional and (b) TSVBOX-based 3D NoC.
RTSV
CTSV
RW
CW
TSV
2
RTSV
2 RW
CW
V
Rdr-conv
Input inverter 
driver
Cp CL
Output 1x-inverter 
driver
Fig. 6: Conventional 3D NoC 3D signal path.
V2
Rdr-S
Input inverter 
driver
S
RTSV/2 RTSV/2 RW
CW 2Cg2Cg CTSV
To other TSVBOXes 
in the data bus
To other TSVBOXes 
in the data bus
To other TSVBOXes 
in the data bus
To other TSVBOXes 
in the data bus
S
RonP
RonN
CN
CP
CN
CP
RTSV/2 RTSV/2
CTSV
RW
CW
RW
CW
CL
Rdr_TSVBOX
Input inverter 
driver
CN
CP
CN
CP
CN
CP
RonP
RonN
S CN
CP
Output 1x-inverter 
driver
V1
RonP
RonN
CN
CP
Rdr_TSVBOX
Input inverter 
driver
S
CN
CPRonP
RonN
Output 1x-inverter 
driver
S
CL
RTSV/2 RTSV/2 RW
CW 2CgCTSV
Rdr-S
Input inverter 
driver
S
2Cg
TSV
TSV
TSV
Cp
Cp
Cp
Cp
RW
CW
RW
CW
Fig. 7: TSVBOX-based 3D NoC 3D signal path.
the output capacitances of 1x-inverter driver shown in Fig. 4a.
Table II shows all the values used in this study [24–26], noting
that the length of the wires is assumed 200 µm similar to [17].
For the TSV technology, the experiments run once for
CTSV =15 fF and another for CTSV =500 fF to cover the
whole range of TSV capacitances and technologies, while the
maximum value for RTSV =1 Ω is selected for all cases [17].
B. Transmission gate transistors
According to [27], transmission gate transistors are usually
selected to have minimum size. Also, as stated in [27], there is
no need to decrease RonP , hence WP=LP and KN=KP=1
(the sizes of NMOS and PMOS transistors, respectively) are
our design choices for the transmission gate transistors.
C. Threshold voltage selection for the drivers
There are two input thresholds: VinL−max and VinH−min.
VinL−max is the maximum low input voltage required to
switch PMOS ON and NMOS OFF at the same time.
Therefore if Vin≤VinL−max, NMOS will be OFF and PMOS
will be ON . The other threshold voltage is VinH−min,
which is the minimum high input voltage required to switch
NMOS ON and PMOS OFF at the same time, therefore if
Vin≥VinH−min, NMOS will be ON and PMOS will be OFF .
Depending on the previous definitions, the thresholds can be
selected as follows
VinL−max = VthN , VinH−min = VDD − |VthP | (3)
VI. TIMING REQUIREMENTS’ ANALYSIS
In this section all timing analysis related to TSVBOX-based
3D NoC is discussed. Timing requirements’ analysis include
3D path delays, how to choose the selection signal period TS ,
and the relation between S and the main clock signal.
A. 3D signal path Elmore-delay model
The 3D signal path delay for data and S signals in con-
ventional and TSVBOX-based 3D NoCs can be approximated
using first order Elmore-delay model [24]. The delay for
data signals is the time required for the 3D data signal to
reach VinH−min=VDD−|VthP |. The 3D data signals passing
through the conventional or the TSVBOX 3D paths are to be
destined to a 1x-inverter driver load, as shown in Figs. (6,7),
respectively. For S signal, the delay is the time required for
S to reach max(VthN ,|VthP |), because S is destined to the
transmission gate transistors (PMOS or NMOS). Hence the
voltage required for S signal to switch either NMOS or PMOS
ON is VthN or |VthP |, respectively. The maximum of these
two voltages is selected to meet both conditions.
According to Section V, the wiring and TSV resistances
are in order of ohms, therefore their contribution to the total
delay is negligible. Thus these resistances will be ignored to
simplify the delay analysis in this paper.
1) Conventional 3D NoC 3D signal delay: Referring to
Fig. 6, the conventional 3D NoC 3D signal path delay
(Td−Conv) can be approximated using Elmore-delay as fol-
lows:
Td−Conv = ln
VDD
|VthP | ·
(
Rdr−Conv ·(Cp+2CW+CTSV +CL)
)
(4)
2) TSVBOX-based 3D NoC 3D signal delay: The TSVBOX
has two different paths according to the values of the S signal.
Fig. 8 shows the situation when S=1 (S=0) that V1 passes
through the TSVBOX, while Fig. 9 shows the other state when
S=0 (S=1) that V2 passes. As noticed, both paths are similar,
therefore the TSVBOX 3D path delay (Td−TSV BOX ) for both
V1 and V2 is the same. The Elmore-delay for both V1 and V2
can be approximated according to the following equation
Td−TSV BOX = ln
VDD
|VthP | ·
(
Rdr−TSV BOX · (Cp + CPN )
+(Rdr−TSV BOX +RPN ) · (4CPN + 2CW + CTSV )
+(Rdr−TSV BOX + 2RPN ) · (CPN + CL)
)
(5)
where CPN = CdbP + CdbN , RPN =
RonP ·RonN
RonP +RonN
3) Selection signal delay: The S signal path is shown
previously as a part of the TSVBOX in Fig. 7. As shown, each
TSVBOX contributes to the load of S signal by 4(CgN+CgP ).
Then for NBW2 data bus width, the TSVBOXes will contribute
with NBW ·(CgN+CgP ) in total to the S signal load. As stated
in Section V, the transmission gates transistors are equally
sized (CgN=CgP=Cg), therefore the total gate load for S
signal is 2N ·Cg . The Elmore-delay for S signal to reach
max(VthN ,|VthP |) is
Td−S = ln
VDD
VDD −max(VthN , |VthP |)
·
(
Rdr−S · (Cp + 2CW + CTSV + 2NBW · Cg)
) (6)
B. Avoiding concurrent ON state of the TSVBOX switches
As shown in Fig. 10. At t=TS2 , S and S will start dis-
charging and charging, respectively. However, S will reach
the ON threshold voltage max(VthN ,|VthP |) at t=T1, while
S will reach the OFF threshold voltage min(VthN ,|VthP |) at
t=T2. This is because S is ’0’ and S is VDD, and ’0’ is much
closer to max(VthN ,|VthP |), than VDD to min(VthN ,|VthP |).
It can be seen that during T1≤t≤T2, all the transmission
gate switches of the TSVBOX are ON which violates the
theoretical behavior of the TSVBOX.
Fig. 11 displays how the above problem could occur as a
result of the previously discussed behavior2. In this example,
V1 and V2 are supposed to be logic ’1’ and ’0’, respectively.
Although, during [0,TS2 ] V1
′
reaches ≈1.32 V and V2
′
is
’0’ and both are acceptable voltage levels (V1
′
>VinH−min
and V2
′
<VinL−max). V1
′
reduces in the period of concurrent
operation (Toverlap) to ≈1.12 V which is <VinH−min. At
t=T2, S reaches the OFF threshold and the switches of V1
path become open so V1
′
will continue at ≈1.12 V till t=TS
(the instant of reading data of downstream router) therefore it
can not be considered as the logic high as expected.
To avoid such problem we must be sure that V1
′
and V2
′
have acceptable voltage levels at t=TS ; the time of reading the
data by the downstream router. A simple solution is to let both
S and S to charge and discharge faster in order to tighten the
overlap period such that V1
′
and V2
′
discharge to acceptable
levels. For instance, in the above illustrated example in Fig. 11,
V1
′
and V2
′
should discharge to some voltages >VinH−min
and <VinL−max, respectively.
In summary we will calculate Toverlap using two different
ways, once by calculating T1 and T2 and then subtracting
T1 from T2 and in this way Toverlap would be a function
of S and S drivers’ sizes. And the second way is by using
the observation that Toverlap should be smaller than or equal
the time taken by V1
′
to reach VinH−min. By equating these
two equations together we can determine the driver size of
both S and S that will result in minimum acceptable Toverlap
such that at t=TS V1
′
and V2
′
would have correct and
2For the sake of better clarity, we derive this example adopting 180 nm
technology, in which VDD=1.8 V, VthN=0.53 V, and |VthP |=0.51 V.
Therefore, voltage levels would be more clear than the case of 65 nm.
SRonP
RTSV
CTSV
V1
RW
CW
CL
Rdr_TSVBOX
Input inverter 
driver
S
CN
CP
S
S Output 1x-inverter 
driver
Cp
TSV
2
RTSV
2 RW
CW
CN
CP
CN
CP
CN
CP
CN
CP
CN
CP
RonN RonN
RonP
Fig. 8: V1 3D path.
S
RonP
RTSV
CTSV
V2
RW
CW
CL
Rdr_TSVBOX
Input inverter 
driver
S
CN
CP
S
S Output 1x-inverter 
driver
Cp
TSV
2
RTSV
2 RW
CW
CN
CP
CN
CP
CN
CP
CN
CP
CN
CP
RonN RonN
RonP
Fig. 9: V2 3D path.
t
S SS, S
t
S S
Trem ≥Td-TSVBOX
S, S
(a)
(b)
T1
min(VthN, VthP)
≈max(VthN, VthP)
min(VthN, VthP)
≈max(VthN, VthP)
TS
2
Large Toverlap
Smaller Toverlap
T2
TS
2 TS
Td-S
 
Fig. 10: S and S signals before (a) and after (b) increasing
driver sizes to avoid negative effects of concurrent ON states
of the TSVBOX switches.
acceptable voltage levels. To simplify the following analysis
two assumptions are made. First, we choose the same driver
size for S and S, so
Rdr−S = Rdr−S
Second, we will assume that TS2 is very large so that V1
′
charges to VDD not only up to VinH−min during [0,TS2 ]
and hence the discharging curve would start from VDD not
VinH−min.
Further, for better visualization, Fig. 12
shows a simplified version of the TSVBOX
circuit, where Cp−PN−1=Cp−PN−2=Cp+CPN ,
CPN−L−1=CPN−L−2=CPN+CL, and the intermediate
capacitance Cint=4CPN+2CW+CTSV .
Assuming T1 is the time taken by S to charge to
max(VthN ,|VthP |), and T2 is the time taken by S to discharge
to max(VthN ,|VthP |), therefore
T1 = τS .ln
VDD
VDD −max(VthN , |VthP |) (7)
T2 = τS .ln
VDD
min(VthN , |VthP |) (8)
And since S and S have the same driver sizes, so
τS=τS=Rdr−S .(Cp+2CW+CTSV +2NBW .Cg)
Then the overlap period can be calculated as
Toverlap = τS .ln
VDD −max(VthN , |VthP |)
min(VthN , |VthP |) (9)
The second way to calculate Toverlap requires the discharge
equation of V1
′
which is not an easy task since during Toverlap
CL1 is discharging in all other capacitors in the circuit and
at the same time it is charging from the source V1 as well
as other capacitors. According to our simulation we found
that the discharging current from CL1 into CL2, Cp−PN−1,
Cp−PN−2, and Cint is the most dominant, thus we developed
the following discharge equation:
V1
′
(t) = Vf + (Vi − Vf ).exp(−t
τ
) (10)
Fig. 11: SystemC-A output signals showing complemented 3D signals multiplexing issues.
where Vi=VDD, Vf=VDD2 , and
τ=2RPN .(CPN−L−2+Cp−PN−1+Cp−PN−2)+RPN .Cint
To check the accuracy of Eq. 10, we calculate the theoretical
value of V1
′
from Eq.10 and compare it with the value gained
from simulation using SystemC-A. The error between the
theoretical and simulational models does not exceed 8% which
indicates the acceptable accuracy of this discharge equation.
Now using Eq.10, the time taken by V1
′
to reach VinH−min
can be simply determined:
T |V1′=VinH−min = Toverlap = τ. ln
VDD − 2Vthp
VDD
(11)
Finally, by equating Eqs. 9 and 11 the value of Rdr−S and
the proper size of the S driver can be determined to avoid the
concurrent operation problems.
It is worth mentioning that, the above stated steps assumes
an ideal case in which V1
′
reaches VDD during charging while
we assume that VinH−min is the enough level to be considered
as logic ’1’ and Td−TSV BOX is defined according to that.
Consequently, the overlap period must be 0 which is realized
using zero resistance selection signal driver which means an
infinite size inverter driver and this is not possible. To solve
such conflict, henceforth we will permit V1
′
and V2
′
to charge
to value slightly higher than VinH−min. For example in this
paper, we choose V1
′
and V2
′
to charge to 1.1VinH−min which
is 10% higher than VinH−min. Hence the equations defining
(V2 is ‘0’)
S
RPN
Cint
Rdr-TSVBOX
Cp-PN-1
Rdr-TSVBOX
S
RPN
RPN
RPN
S
S
V1’
V2’
Cp-PN-2
CPN-L-1
CPN-L-2
Closed 
at t=0
 Ts
2
at t=
Closed
VDD (V1 is ‘1’)
Fig. 12: TSVBOX simplified circuit model.
Td−TSV BOX and Toverlap would be changed to
Td−TSV BOX = ln
VDD
|VthP | − 0.1(VDD − |VthP |)
·
(
Rdr−TSV BOX · (Cp + CPN ) + (Rdr−TSV BOX +RPN )
·(4CPN + 2CW + CTSV ) + (Rdr−TSV BOX + 2RPN )
·(CPN + CL)
)
(12)
and
Toverlap = τ · ln VDD − 2|VthP |+ 0.2(VDD − |VthP |)
VDD − 2|VthP | (13)
, respectively. Apparently, both delays are higher than the
delays of Eqs. 5 and 11, respectively.
C. Minimum duration of the selection signal
After introducing the driver design of S (S) in the previous
subsection, its clocking period TS can be derived. As shown in
Fig. 10, the period of time between TS2 ≤t≤TS , can be divided
into three smaller periods:
(1) Td−S ; the time required by S to reach max(VthN , |VthP |).
(2) Toverlap; the permitted overlap period of concurrent ON
state of TSVBOX switches.
(3) Trem; the remaining time until TS . During this period,
V2
′
signal is required to reach an acceptable level ’0’ or
’1’, therefore we must select Trem≥Td−TSV BOX .
Based on these observations the minimum clocking period
TS−min for S (S) control signal can be expressed as follows:
TS−min = 2(Td−overlap + Td−S + Td−TSV BOX) (14)
D. Selection signal generation
In order to simplify the design procedure, S signal should
be derived from the clock signal itself. To simplify more, TS
can be selected as an even integer multiple of the clocking
period TCLK , i.e. TS=2n.TCLK , and n is chosen such that
the inequality TS≥TS−min is fulfilled. To select n, the rela-
tion between TS−min and TCLK must be known first. This
inequality has three different cases:
(1) TCLK<
TS−min
2 : we can choose
TS
2 =n.TCLK , n=2, 3,
etc., or in other words TS=2n.TCLK3, and we choose n
such that TS2 ≥TS−min2 .
(2) TS−min2 ≤TCLK<TS−min: in this case the period of time
TCLK is sufficient to pass V1 or V2 (but not both) through
the TSVBOX, therefore TS2 can be chosen to be TCLK ,
or more formally TS=2TCLK (n is constant equal 1).
(3) TCLK≥TS−min: in this case the period of time TCLK
is sufficient to pass V1 then V2 (or vice versa), serially.
Therefore we can choose TS=TCLK (n is constant equal
to 12 ).
As noticed from the above cases, the TSVBOX degrades
the performance of the 3D NoC in cases (1) and (2), where
we need more than one clock cycle to transfer the data per
vertical hop. In contrast, in case (3) the TSVBOX shows
the same performance as conventional 3D NoC, because the
multiplexing operation needs only a clock cycle to transfer
both signals, in the same way similar to the conventional case.
Also, in the last case, there is no need for frequency dividers
to generate the S signal, since it can be chosen as the clock
signal itself.
E. Communication procedure between two conventional 3D
routers
To define a communication protocol for the TSVBOX-based
3D router, the conventional procedure should be explained
first. The communication signalling between two conventional
3Such clock frequency division can be realized easily using injection-locked
frequency dividers (ILFDs). Most of ILFDs have been optimized for division
by even numbers [28].
3D routers inside the 3D NoC shown in Fig. 13, is done using
the synchronous REQ and ACK signals to transfer a flit from
upstream to downstream router. This communication protocol
can be summarized as follows:
(1) The transmitting router initiates a request to the receiving
router by raising the REQ signal to ’1’. At the same time
the data flit is ready on the data bus.
(2) After at least one cycle, the initiated request will be
recognized by the input controller of the receiving router.
If there is at least one free slot in the FIFO buffer,
the packet will be read and the acknowledgement ACK
signal will be set to ’1’, announcing that the packet is
received successfully.
(3) After another cycle from setting the ACK, the trans-
mitting router will reset the REQ again to ’0’, upon
detecting the assertion of ACK. Resetting REQ is the
end of the communication operation.
REQ
(Upstream)
ACK
(Downstream)
CLK
t
t
t
TCLK
FLIT
(Upstream)
t
Flit
Instant of reading the flit
Fig. 13: Synchronous communication protocol.
F. Communication procedure between two TSVBOX-based 3D
routers
The communication protocol for the TSVBOX-based 3D
routers, is the same as conventional procedure of the con-
ventional routers for horizontal hops. For vertical hops the
procedure is changed to account for extra cycles needed by
the TSVBOX to transfer the data. As shown in Fig. 14, a new
internal signal called OnePulse is introduced. OnePulse is
used to power gate the driving circuit of the S and S signals
so as to reduce their power consumption when there is no
transmission is in progress. This signal is activated when the
output controller decides to send a flit to the downstream
router. Since the TSVBOX needs 2n.TCLK to transfer two
multiplexed signals successfully, the OnePulse signal lasts
for only 2n.TCLK , n=0.5, 1, 2, 3, etc. as shown in Subsection
VI-D. Since both S and S are needed only during transmission,
the OnePulse signal can be used to activate or deactivate
those two overhead signals. This activation or deactivation can
be translated as power gating technique, as shown in Fig. 15.
REQ
(Upstream)
ACK
(Downstream)
CLK
t
t
t
TCLK
FLIT
(Upstream)
t
Data flit
OnePulse
(Upstream)
Instant at which the downstream router can 
read the flit
t
REQ lasts at least for two cycles 
This pulse lasts for a number of 
cycles = EWCs+1 = 2n
EWCs.TCLK TCLK
Fig. 14: Vertical communication synchronous protocol between two TSVBOX-based 3D routers.
Now the TSVBOX vertical communication procedure can
be summarized as follows:
(1) The upstream router set the internal OnePulse signal that
enables both S and S signals, so that the two multiplexed
signals can pass to the data bus first.
(2) Since the downstream router takes at least one cycle to
read REQ after its initiation, REQ can be set by the
upstream router after (2n−1).TCLK .
(3) Step 2 of the conventional procedure of Subsection VI-E.
(4) Step 3 of the conventional procedure of Subsection VI-E.
As observed the TSVBOX-based 3D routers imposes extra
waiting cycles (EWCs) equal (2n−1) which is the number
of TCLK cycles that the router must wait before initiating
a request, and thus it may degrade the overall 3D NoC
performance. However, we will show in Section VII-D that this
performance degradation is dependent on the traffic pattern of
the 3D NoC and other parameters as well.
S (S) 
To the 
TSVBOXesOnePulse
Fig. 15: Simple power gating technique using OnePulse
signal.
VII. TSV MULTIPLEXING SCALABILITY ANALYSIS
In the previous sections TSV multiplexing has been studied
in details for 2×1 TSV multiplexing ratio. In this section the
study is generalized for the general case of NMUX×1 TSV
multiplexing, where NMUX is the TSV multiplexing ratio
(NMUX≥2). NMUX is assumed to be a power of 2 arbitrary
integer, i.e. NMUX=2, 4, 8, etc..
Fig. 16 shows the TSVBOX SystemC-A circuit modeling
for the general case of NMUX×1 multiplexing ratio and a
special case when NMUX=4. The mechanism can be consid-
ered as a generalization for the previous mechanism of 2×1
multiplexing. As shown in Fig. 16a, the multiplexed signals are
passing through the TSVBOX during exclusive combination of
the selection control signals. And since we have 2NMUX differ-
ent combinations of the S’s signals, i.e. (S1.S2 ···Slog2NMUX ),
(S1.S2 ···Slog2NMUX ), ··· , (S1.S2 ···Slog2NMUX ), each signal
from the NMUX different input signals can pass exclusively
through the TSVBOX. For example, in 4×1 multiplexing Figs.
(16b,16c), V1 passes through the TSVBOX if all the S’s
signals are ’1’ (S1.S2=1), and this is only valid during the
first half cycle of S1, while V2 passes if S1.S2=1, and this
combination can be logic ’1’ only during the second half cycle
of S1. The same logic can be repeated for V3 and V4.
In the following subsections, the analysis of scalability study
is provided to show how much higher TSV multiplexing ratios
will impact control signals (S’s), EWCs, and TSVBOX delays.
In the final subsection, the scalability impact on performance
under different traffic patterns is analyzed.
A. Effect of multiplexing on selection control signals
As shown in Fig. 16a, the number of selection control
signals (S’s) used is always log2(NMUX). However, we need
to generate NMUX signals from these log2(NMUX) signals
to drive the TSVBOX. This means that, we need NMUX
extra TSVs to transfer them for each vertical bus width. This,
of course, enforces a limitation on the multiplexing ratio to
be used, especially if NBW is comparable to NMUX . For
example, if NBW=128 and NMUX=128, then no TSV saving
will be gained from the TSVBOX. Another limitation comes
from the generation circuit needed to generate the S’s signals
(AND gates and the inverters). The good point is that, this
generation circuit is not repeated in each layer, we assume it
is only placed in the first layer beside the clock dividers. We
assume also that the area of the system circuit on the first
die is extremely larger than the area overhead of the signal
generation circuit, hence negligible contribution to the overall
area.
The frequencies of the selection control signals are related
to each other, e.g. in 4×1 case, fS1=2fS2 , which is simply
synthesized using digital frequency dividers. The selection
signal with maximum frequency (fS1 ) is determined using
the same way explained in Subsection VI-D according to the
relation between TCLK and
TS1−min
2 (minimum time duration
required to transmit any of the input multiplexed signals).
B. Effect of multiplexing on EWCs
Again the TSVBOX imposes extra delay in the shape of ex-
tra waiting cycles (EWCs), and these extra delay cycles depend
on TCLK ,
TS1
2 , and the TSV multiplexing ratio NMUX . To
calculate EWCs in case NMUX>2, we can generalize the 2×1
case of Subsection VI-F. Depending on the relation between
TCLK and
TS1−min
2 , the EWCs simply can be shown to be
(from Fig. 14):
EWCs =
NMUX .
TS1
2
TCLK
− 1
and since TS12 =n.TCLK , therefore
EWCs = n ·NMUX − 1 (15)
where n =

2, 3, etc. : TCLK<
TS1−min
2
1 :
TS1−min
2 ≤TCLK<TS1−min
1
2 : TCLK≥TS1−min
C. Effect of multiplexing ratio on TSVBOX delays
The increase in TSVBOX delay is linear with NMUX
multiplexing ratio. The delay of any multiplexed signal passes
V2
V2’
RonP
RonN
CN
CP
CN
CP
RTSV
CTSV
V1’
RW
CW
RW
CW
CL
Rdr-TSVBOX
Input inverter 
driver
CN
CP
V1
CN
CP
Input inverter 
driver
Output 1x-inverter 
driver
CL
TSVCp
Cp
VNMUX
CN
CP
CN
CP
Input inverter 
driver
Cp CL
Output 1x-inverter 
driver
Output 1x-inverter 
driver
Rdr-TSVBOX
Rdr-TSVBOX
RonP
RonN
RonP
RonN
RonP
RonN
RonP
RonN
RonP
RonN
CN
CP
CN
CP
CN
CP
CN
CP
CN
CP
CN
CP
2
RTSV
2
VNMUX
S1.S2 ...Slog2(NMUX) S1.S2 ... Slog2(NMUX)
S1.S2 ...Slog2(NMUX)S1.S2 ...Slog2(NMUX)
S1.S2 ...Slog2(NMUX) S1.S2 ...Slog2(NMUX)
(a)
V2
V2’
RonP
RonN
CN
CP
CN
CP
RTSV
CTSV
V1’
RW
CW
RW
CW
CL
Rdr-TSVBOX
Input inverter 
driver
CN
CP
V1
CN
CP
Input inverter 
driver
Output 1x-inverter 
driver
CL
TSVCp
Cp
V4
CN
CP
CN
CP
Input inverter 
driver
CN
CP
V3
CN
CP
Input inverter 
driver
Cp
Cp
V4’
V3’CL
CL
Output 1x-inverter 
driver
Output 1x-inverter 
driver
Output 1x-inverter 
driver
S1.S2 S1.S2
S1.S2 S1.S2
S1.S2 S1.S2
S1.S2 S1.S2
Rdr-TSVBOX
Rdr-TSVBOX
Rdr-TSVBOX
RonP
RonN
RonP
RonN
RonP
RonN
RonP
RonN
RonP
RonN
RonP
RonN
RonP
RonN
CN
CP
CN
CP
CN
CP
CN
CP
CN
CP
CN
CP
CN
CP
CN
CP
2
RTSV
2
(b)
S2
S1
OnePulse
t
t
t
S1.S2=1
V1 pass
S1.S2=1
V2 pass
S1.S2=1
V3 pass
S1.S2=1
V4 pass
TS2
TS1
(c)
Fig. 16: TSVBOX circuit models for (a) the general case
NMUX×1 and special case in (b) 4×1 TSV multiplexing
ratios. (C) Selection control signals required for 4×1 TSV
multiplexing. V1 passes when S1.S2=1, V2 passes when
S1.S2=1, etc..
through the TSVBOX can be calculated as
Td−TSV BOX = ln
VDD
|VthP | − 0.1(VDD − |VthP |) ·(
Rdr−TSV BOX · (Cp + CPN ) + (Rdr−TSV BOX +RPN )
· (2NMUX .CPN + 2CW + CTSV )+
(Rdr−TSV BOX + 2RPN ) · (CPN + CL)
)
(16)
where we considered the overlap penalty explained in Sub-
section VI-B. As observed, Eq. 12 is a special case of Eq. 16
when NMUX=2.
The multiplexing ratio affects the S’s signals’ delays as
well, but this time, it is a positive effect; in other words,
the higher the multiplexing ratio, the smaller the S’s signals’
delays. For example, for the case of NMUX=2, each transmis-
sion gate contributes by 2Cg , hence the total loading contribu-
tion for either S or S is
(
4
(
NBW
NMUX
)·Cg=2NBW ·Cg=128Cg),
as depicted in Eq. 6. Another example, for the case of
NMUX=4, each TSVBOX contributes to each control signal
(S1.S2, S1.S2, etc.) load by 2Cg as well. However, since
the total number of TSVBOXes has reduced to 16, i.e.
NBW
NMUX
= 644 =16, the total loading contribution would be 64Cg
for each control signal. In general, based on the previous
analysis, Eq. 6 can be rewritten as
Td−S = ln
VDD
VDD −max(VthN , |VthP |) ·(
Rdr−S · (Cp + 2CW + CTSV + 4 NBW
NMUX
· Cg)
)
(17)
D. Effect of multiplexing ratio on on performance under
different traffic patterns
In this subsection we analyze the scalability issues on
the performance of TSVBOX-based 3D NoC versus the
conventional one. We choose three different traffic patterns;
Matrix Transpose4, Uniform, and Hotspot. The packet injec-
tion process of each traffic flow is chosen to be a Poisson
random process where the time interval between successive
injections is represented as exponential random variable [29].
Poisson distributed injection rate is adopted in this study
because it successfully characterizes the performance of mul-
tiprocessor applications [30, 31]. The transmitter waits for
Tinterval=−λ.ln(U) between two successive packet transmis-
sions, where Tinterval is the exponential random time interval
between two successive transmitted packets, λ is the reciprocal
of the average injection rate of the process, i.e. the average
waiting time between two successive transmissions, and U is
a uniform random variable between ’0’ and ’1’ [32]. Table III
states all the definitions of the terms used in this analysis.
The following three equations5 define the average extra
waiting cycles EWCsavg per vertical hop faced by packets
traverse vertically between 3D stack layers for Transpose,
4Henceforth, we would call it Transpose for simplicity.
5The proofs of these three equations are found in Appendix A.
Uniform, and Hotspot traffics, respectively.
EWCsavg−Transpose = 2.NPKT .(n.NMUX − 1).
[N2
4
]
(18)
EWCsavg−Uniform = 2.NPKT .(n.NMUX − 1)·[ 1
N
.
( N2 −1∑
i=0
(N − i).(N − i− 1)
2
+
N
2 −1∑
i=0
i.(
N
2
− i)
)]
(19)
EWCsavg−Hotspot = 2.NPKT .(n.NMUX − 1)·[ (1− h)
N
.
( N2 −1∑
i=0
(N − i).(N − i− 1)
2
+
N
2 −1∑
i=0
i.(
N
2
− i)
)]
(20)
It can be seen that the difference between the above three
equations lies in the terms between brackets ”[]”:
• In Transpose case I=N
2
4
• In Uniform case II= 1N .
[∑N
2 −1
i=0
(N−i).(N−i−1)
2 +∑N
2 −1
i=0 i.(
N
2 − i)
]
• In Hotspot case III= (1−h)N .
[∑N
2 −1
i=0
(N−i).(N−i−1)
2 +∑N
2 −1
i=0 i.(
N
2 − i)
]
Fig. 17 displays the variation of the previous three terms versus
N , when h=0.5. We can notice that I>II>III for all values
of N . Which means that Hotspot traffic is expected to show
better performance than Uniform and Transpose traffics, and
Uniform traffic in turn is expected to outperform Transpose.
VIII. FABRICATION YIELD AND COST ANALYSIS
In this section, fabrication yield and cost are analyzed based
on the analysis of [1] for W2W bonding process. The design
methodology proposed in our study in this paper introduces
extra redundant hardware that may affect the overall system
area and hence the overall yield and cost. Hence the analysis
of this section is crucial for the overall evaluation of the TSV
multiplexing technique. In the following subsections we briefly
overview the normalized yield models derived in [1]. In the
following analysis we use the terms defined in Table IV.
TABLE III: Definitions of all terms used in the performance
analysis of the TSVBOX-based 3D NoC.
Parameter Definition
N The number of 3D stack layers
NPKT The number of packets sent by each processing node
NMUX Multiplexing ratio of the TSVBOX
n n=0.5, 1, 2, 3, 4, etc.
dependent on the relation between TCLK and
TS1−min
2
h 0≤h≤1.
In Hotspot traffic, h represents the portion of the total
transmitted packets by any node that is directed to the
hotspot node in the 3D NoC
EWCsavg The average EWCs exposed by TSVBOX-based 3D NoC
per one vertical hop
XSIZE The number of nodes in X dimension of the 3D NoC
YSIZE The number of nodes in Y dimension of the 3D NoC
ZSIZE The number of nodes in Z dimension of the 3D NoC
2 3 4 5 6 7 8 9 100
5
10
15
20
25
Number of layer  N
 
I
,
 
I
I
,
 
a
n
d 
 
I
I
I
 
 
t
e
r
m
s
 
 
 
I
II
III
Fig. 17: The relation between the terms I , II , and III .
A. Fabrication yield analysis
As stated in [1], there are two main components that affect
the overall yield of the W2W bonding process:
• Normalized stacking yield: According to [1], the normal-
ized stacking yield is equivalent to the normalized TSV
yield (YTSV−norm) which is the probability that all TSVs
are non-faulty
YS,W2W−norm = YTSV−norm (21)
=
(1− fTSV )(NTSV |after multiplexing)
(1− fTSV )(NTSV |before multiplexing)
where NTSV |before multiplexing and
NTSV |after multiplexing can be calculated as follows
NTSV |before multiplexing = 2M · (NBW + 2) (22)
NTSV |after multiplexing = 2M ·
( NBW
NMUX
+2+NMUX
)
(23)
Of course we can gain TSV yield improvement iff
NTSV |after multiplexing<NTSV |before multiplexing . If
TABLE IV: Definitions of all terms used in Eqs. (21-31).
Parameter Definition
Adie|before multiplexing Die area before TSV multiplexing
Adie|after multiplexing Die area after TSV multiplexing
Arouters,cores The aggregate area of NoC routers and cores
per one die
M Mesh size, e.g. M=4×4, 8×8, etc.
DTSV The TSV diameter
ATSV The cross-sectional area of one TSV
AMUX The multiplexer area of the TSVBOX
Adr−S The S signal driver area
ATG The transmission gate area used for clock gating
of the S’s control signals
fTSV The probability of fabricating a non faulty TSV
NTSV The number of TSVs between two layers in a
3D stack
α A constant depends upon the complexity of the
manufacturing process
Do Average density of defects per die.
0 50 100 150 200 250 3001
2
3
4
5
6
7x 10
−9
 
T d
−T
SV
BO
X 
,
 
T d
−C
on
v
 NMUX
 
 
 Td−TSVBOX
 Td−TSVBOX*
 Td−Conv
(a) CTSV=15 fF.
0 50 100 150 200 250 3001
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6x 10
−8
 
T d
−T
SV
BO
X 
,
 
T d
−C
on
v
 NMUX
 
 
 Td−TSVBOX
 Td−TSVBOX*
 Td−Conv
(b) CTSV=500 fF.
0 50 100 150 200 250 3000
1
2
3
4
5
6
7x 10
−9
 
T d
−S
 NMUX
 
 
 Td−S, CTSV=15 fF
 Td−S, CTSV=500 fF
(c) Selection control signal delay.
Fig. 18: TSVBOX vs. conventional 3D interconnect delays for
(a) CTSV =15 fF and (b) CTSV =500 fF. The Td−TSV BOX∗
is a symbol for the TSVBOX delay of Eq. 5 before adding
the 10% increase to VinH−min. In (c), the change of the
selection control signal Td−S is depicted for CTSV =15 fF
and CTSV =500 fF.
we assume that both NMUX and NBW are power of 2
integers, then, we can show that the condition that makes
NTSV |after multiplexing<NTSV |before multiplexing is
simply NMUX≤NBW2 . For example, if NBW=256 and
NMUX=128, so NTSV |before multiplexing=2M×(258)
and NTSV |after multiplexing=2M ·(132). Now,
for the same NBW , if we assume NMUX=256,
NTSV |after multiplexing would be 2M×(259), which is
clearly larger than 258; the required TSVs for no TSV
multiplexing case.
• Die yield: Recalling [1], the normalized die yield
Ydie−norm can be represented as
Ydie−norm =
(
α+Do ·Adie|before multiplexing
α+Do ·Adie|after multiplexing
)α
(24)
As observed, Ydie−norm is in turn, dependent on the die
area before and after TSV multiplexing, therefore it is af-
fected by the proposed design methodology described in
this paper. Regarding area, the proposed design method-
ology may affect the sizes of the drivers, especially for S
signal. Also, it requires special TSVs to transfer selection
control signals between layers, which in turn adds extra
area overhead for those extra TSVs. All these factors
affect die area and in turn the die yield. Similar to the
analysis of [1], the die area is dependent on die location,
and the upper/bottom most dies have different areas than
those of intermediate dies. The die area of upper and
bottom-most dies (die1,dieN ) before multiplexing can be
estimated by
Adie1,N |before multiplexing = Arouters,cores+
2M · (NBW + 2) ·ATSV (25)
and for intermediate dies (die2, die3, ..., dieN−1)
Adie2,3,...,(N−1) |before multiplexing = Arouters,cores+
4M · (NBW + 2) ·ATSV (26)
Similar procedure can be followed for the TSVBOX-
based 3D NoC case, therefore, for die1 and dieN
Adie1,N |after multiplexing = Arouters,cores+
2M ·
(
(
NBW
NMUX
+ 2 +NMUX) ·ATSV +NBW ·ATG
+NMUX · (Adr−S +ATG)
)
(27)
and for intermediate dies (die2, die3, ..., dieN−1)
Adie2,3,...,(N−1) |after multiplexing = Arouters,cores+
4M ·
(
(
NBW
NMUX
+ 2 +NMUX) ·ATSV +NBW ·ATG
+NMUX · (Adr−S +ATG)
)
(28)
where the TSV cross-sectional area ATSV can be calcu-
lated based on the value of TSV diameter DTSV
ATSV = pi(
DTSV
2
)2 (29)
• Overall yield: Finally, the normalized overall yield
YW2W−norm is expressed according to [1] as:
YW2W−norm = (Ydie−norm−top,bottom)2·
(Ydie−norm−intermediate)(N−2) · (YTSV−norm)N−1
(30)
B. Fabrication cost analysis
As derived in [1], the overall W2W normalized cost
CW2W−norm can be considered as the reciprocal of the overall
W2W normalized yield
CW2W−norm =
CW2W |after multiplexing
CW2W |before multiplexing
=
1
YW2W−norm
(31)
IX. SIMULATION PLATFORM AND RESULTS
A. 3D signal path delays
For the sake of fair comparison between the conventional
and TSVBOX-based 3D NoCs, both are operated and designed
based on the same input frequency and the same data driver
size (KN |before multiplexing=KN |after multiplexing).
Figs. (18a,18b) depicts the change in Td−TSV BOX vs.
multiplexing ratio NMUX under two different values of CTSV .
While TConv remains constant, Td−TSV BOX increases lin-
early with increasing NMUX , and this is applicable for both
Eqs. (5,12) of the Td−TSV BOX delays. Fortunately, the linear
increase is very small, because of the small slope;
Slope = 2 ln
VDD
Vthp
· CPN · (Rdr−TSV BOX +RPN ) (32)
≈ 0.011178 nsec per unit change in NMUX
An important notice here, is that the slope is independent of the
TSV technology. Depending on Fig. (18a,18b) in case of 500
fF TSV capacitance, the increase in Td−TSV BOX delay when
NMUX changes from 2 to 256 is 3.35 nsec, while for 15 fF
TSV capacitance it was exactly 3.35 nsec, as well. Apparently,
those identical values support the theoretical expectation based
on Eq. 32.
Also Figs. (18a,18b) depict that the TSVBOX shows a
delay impact due to the redundant circuits added for mul-
tiplexing. Although, when CTSV is large, e.g. CTSV =500
fF, the ratio Td−TSVBOXTConv is slowly increasing by the increase
of NMUX , reducing the impact of NMUX on Td−TSV BOX .
For example, in case of CTSV =500 fF, for NMUX=2, the
ratio Td−TSVBOXTConv =1.6631, and for NMUX=256 it is 1.9150.
However, for the case of CTSV =15 fF, the ratios were, 1.7108
and 3.7991 (more than the double of 1.7108), respectively.
Fig. 18c displays the change in Td−S versus NMUX , and as
stated before in Subsection VII-C, the selection control signals
show less delay for higher multiplexing ratios. However, the
reduction is very small since the term (4 NBWNMUX .Cg) of Eq.
17 is usually masked by the large values of (CTSV ,2CW )
capacitors and because of the very small value of the gate
capacitance Cg itself.
TABLE VI: Theoretical and simulational 3D signals’ delays for conventional and TSVBOX paths for 65 nm technology for
different TSV capacitances.
3D signal
delay Unit 15 fF 500 fF
Theoretical Simulational |error| Theoretical Simulational |error|
Td−Conv nsec 1.38 1.44 4.166% 11.27 11.36 0.968%
Td−TSV BOX nsec 2.33 2.31 0.87% 18.75 18.74 0.053%
Td−S nsec 0.833 0.9 7.44% 6.198 6.42 3.46%
To check the accuracy of the delay models proposed, we
compared the theoretical delays calculated from Eqs. (4,12,6),
and the SystemC-A simulation delays. As depicted in Table
V, it is assumed that all the drivers are 1x-inverters (KN=1,
KP=2), thus all the driver resistances are adjusted to their
maximum values. As stated in Table VI, the error between the
theoretical and the simulational delay results does not exceed
7.5%, which indicates the acceptable accuracy of Elmore-delay
model.
B. Performance comparison under synthetic traffic patterns
The simulation results shown in Figs. (19,20,23) indicate
the average delay and throughput comparisons of the two
implemented 4×4×4 3D NoCs; the TSVBOX-based and con-
vectional ones. The average delay of a packet is defined as
the total cycles taken by the packet flits to cross the network
towards its destination node. That delay spans from creation
of the first flit (head flit) of the packet, to when its last flit
(tail flit) is ejected at the destination (assuming immediate
ejection), including source buffer queuing time in cycles [33].
For the average throughput, it is defined as the average
ejection rate of the packets at their destination nodes. We set
the simulation warm-up period to 2000 cycles in which we
avoid calculating results until the network get congested [14].
Thereafter, similar to the methodology in [33], the simulation
is run with 32,000 packets; 500 packet injected from each
node, and the simulation continued at the prescribed packet
injection rate till these packets have all been received, and
their average delay and throughput are calculated.
Based on TS−min we choose to run the simulations for
three different clock periods to experience the effect of
TSV multiplexing in different situations; TCLK=TS−min,
TCLK=
TS−min
2 , and TCLK=
TS−min
4 . We choose to calculate
TS−min for CTSV =500 fF, since it will be an upper bound
for the TSVBOX delay degradation. Though, the same con-
clusions derived later from Figs. (19,20,23) will not change
TABLE V: Simulation setup for the sizes of the inverter drivers
and their equivalent ON resistances.
Design parameter Unit 65 nm
15 fF 500 fF
(KN−Conv ,KP−Conv) - (1,2) (1,2)
(KN−TSV BOX ,KP−TSV BOX ) - (1,2) (1,2)
(KN−S ,KP−S ) - (1,2) (1,2)
Rdr−Conv kΩ 21.654 21.654
Rdr−TSV BOX kΩ 21.654 21.654
Rdr−S kΩ 21.654 21.654
TABLE VII: Different delays for different multiplexing ratios.
Delays NMUX×1
2×1 4×1 16×1
Td−TSV BOX 19.06 19.08 19.22
TS−min 24.40 24.42 24.55
for the case of CTSV =15 fF. Since Td−TSV BOX is multi-
plexing ratio dependent, Table VII displays the values of the
Td−TSV BOX and Td−S versus NMUX .
As discussed in Subsection VII-B, depending on the relation
between TCLK and TS−min, TSVBOX adds EWCs, which are
calculated using Eq. 15 and depicted in Table VIII for different
multiplexing ratios.
What has been proven analytically in Subsection VII-D can
be deduced clearly from Figs. (19, 20,23). We prove previously
in Subsection VII-D that degradation in performance due to
EWCs of the design methodology is application dependent,
and we show analytically that Hotspot traffic gives closer per-
formance to the conventional case than other traffic patterns.
We show also that the performance under Transpose traffic
will be the worst against other patterns; Hotspot and Uniform.
Those conclusions are very readable from all performance
figures.
As expected, for the less TCLK compared to TS−min the
degradation in performance metrics will be more. For example
when TCLK=TS−min in Figs. 19c and 19f, saturation point
occurs at Injection rate (IR) ≈0.01 packet per cycle for 16×1
multiplexing, while for the conventional case it is around
≈0.05 packet per cycle. This situation is repeated in all cases
but it is decreased or worsen according to the traffic pattern
used or the relation between TCLK and TS−min.
In Figs. (19a,19d,20a,20d,21a,21d) we observe that the
TSVBOX does not degrade the performance for 2×1 mul-
tiplexing. This strengthens the multiplexing technique since it
can improve the yield without degrading the performance.
In summary, though the TSVBOX increases the delay of
the 3D signals, according to the application and the relation
between TCLK and TS−min, the TSVBOX-based 3D NoC can
mitigate the performance degradation and shows a comparable
performance compared to conventional 3D NoC even for high
multiplexing ratios.
C. Performance comparison using real Benchmark traffic
In this subsection another performance comparison is
shown, this time under real benchmark traffic of the well
known dVOPD video application traffic [37]-[40] with task
graph shown in Fig. 22. Since mapping the tasks of the
(a) TCLK=TS−min (b) TCLK=0.5TS−min (c) TCLK=0.25TS−min
(d) TCLK=TS−min (e) TCLK=0.5TS−min (f) TCLK=0.25TS−min
Fig. 19: Average delay and throughput under Transpose traffic pattern.
(a) TCLK=TS−min (b) TCLK=0.5TS−min (c) TCLK=0.25TS−min
(d) TCLK=TS−min (e) TCLK=0.5TS−min (f) TCLK=0.25TS−min
Fig. 20: Average delay and throughput under Uniform traffic pattern.
(a) TCLK=TS−min (b) TCLK=0.5TS−min (c) TCLK=0.25TS−min
(d) TCLK=TS−min (e) TCLK=0.5TS−min (f) TCLK=0.25TS−min
Fig. 21: Average delay and throughput under Hotspot traffic pattern.
TABLE VIII: EWCs under different clock periods and multi-
plexing ratios.
TCLK EWCs
2×1 4×1 16×1
TS−min 0 1 7
TS−min
2
1 3 15
TS−min
4
3 7 31
task graph to the NoC cores is NP hard problem [37]-[40],
we model this mapping problem using MiniZinc modeling
discrete optimization language [42]-[43]. Modeling is done
to get the mapping with with minimum communication cost,
where the communication cost is defined as the sum of the
communication bandwidth between each two tasks or vertices
in the task graph (BWij) multiplied by the number of hops
between those two tasks in the NoC (Hij). Thus the objective
function can be represented as
minimize
∑
i
∑
j
BWij ·Hij (33)
The difference in performance between TSVBOX-based
and conventional 3D NoCs arises from the extra hops the
TSVBOX may add to the flits going vertically because of
the extra waiting cycles. Our implemented simulation platform
introduces a 9 cycles delay per one horizontal or vertical hop
for conventional 3D NoC while for TSVBOX-based 3D NoC it
gives the same horizontal hop delay of 9 cycles but 9+EWCs
per vertical hop. a new metric is introduced to account for this
vertical delay which is VHD (Vertical Hop Delay). Of course
VHD is constant equal 9 for conventional 3D NoC while it is
9+EWCs for TSVBOX-based 3D NoC.
Similar conclusions to the ones we got in the previous
subsection are made from Figs. 23a and 23b, again the
performance of the TSVBOX is dependent on the EWCs it
introduces. For example if VHD=9 (EWCs=0) we get similar
performance to original conventional case without multiplex-
ing and then the performance degrades based on the amount
of EWCs being added.
D. Performance comparison with TSV serialization technique
Serialization is the most famous technique to reduce TSVs
to increase the yield and reduce the fabrication cost. In
this subsection we study the performance of the TSVBOX
against the serialization technique. The common impact of
both techniques is that they add extra delay to the 3D signals
passes the vertical or interlayer interconnects. The EWCs
delay added by the serialization is constant equal NSER+2
cycles, where NSER is the serialization ratio, e.g. for 2x1
serialization NSER is 2. Though, the TSVBOX extra delay
cycles is dependent on the frequency of operation and given
by Eq. 15. In Eq. 15 the parameter n is dependent on the
relation between the TSVBOX delay and the operational clock.
This leads to the fact that the EWCs of the TSVBOX can
be less than the one of the serialization and the TSVBOX
1 2 3 4
7 6 5 15
8 9 11 12
10
31
14 13
16 17 18 19
22 21 20 30
23 24 26 27
25
32
29 28
36270 362
357353 27
16313 16
16
1616 157
313
94500
300
49362
16
36270 362
357353 27
16313 16
16
1616 157
313
94500
300
49362
16
126
126
540
540
Fig. 22: dVOPD communication task graph with communi-
cation bandwidth stated in MB/s on each edge [41] between
each two tasks (vertices).
can show better performance than the serialization technique.
Another advantage for the TSVBOX is that it performs better
as the technology scales because we expect lower TSVBOX
delay and thus lower n. The latter advantage gives more
opportunities for the TSVBOX to be the best candidate to
reduce the TSVs for future 3D chips.
Fig. 24 clarify those observations regarding performance
difference between serialization and TSVBOX techniques.
For example, in Figs. 24a and 24d for 2x1 multiplex-
ing/serialization while serialization technique always adds 4
cycles delay (EWCs=4) the TSVBOX EWCs delays range
between 0 and 3 cycles (refer to table VIII) so in all those cases
the TSVBOX outperforms the serialization technique. For the
other examples in Figs. (24b,24e,24c,24f) and depending on
the EWCs that each technique adds the TSVBOX outperforms
the serialization in some situations and vice versa, e.g. for
4x1 multiplexing/serialization the TSVBOX outperforms the
serialization when EWCs of the TSVBOX less than 6 which
is the EWCs added by serialization and same conclusion can
be concluded for 16x1 multiplexing/serialization.
E. Power comparison
Since there is no general benchmark method defined for
NoCs to measure power consumption [34], we follow the
methodology mentioned in [16, 30] to calculate the power
(a) Delay.
(b) Throughput.
Fig. 23: Average delay and throughput under real dVOPD
benchmark traffic.
consumption of all power components: the intralayer intercon-
nects, i.e. the 2D interconnects or links between routers located
in the same layer of the 3D stack, the interlayer interconnects,
i.e. the 3D interconnects or links between routers of two
neighbor layers of the 3D stack, and the 3D routers themselves.
For both 3D and 2D interconnects, we developed a specific
methodology based on the ability of SystemC-A to measure
the currents flow through the modeling circuits of Figs.
(6,16a) for different input data. Large sequence of random
flits are continuously fed to the 2D or 3D interconnects, at the
same time the simulator takes samples of the currents of the
interconnects every specified period of time and a counter is
continuously counting the number of samples. After receiving
all the sent flits, the simulator multiply the sum of the samples
by VDD and take the average by dividing the calculated value
by the counter value.
(a) 2x1 Multiplexing/Serialization. (b) 4x1 Multiplexing/Serialization. (c) 16x1 Multiplexing/Serialization.
(d) 2x1 Multiplexing/Serialization. (e) 4x1 Multiplexing/Serialization. (f) 16x1 Multiplexing/Serialization.
Fig. 24: Performance comparison between TSVBOX and Serialization techniques for different TSV Multiplexing/Serialization
ratios.
Similarly to the simulation scenario performed in SystemC-
A, we developed a model to measure the power consumption
of the 3D router when flits pass through. Former, 3D router
is written in Register Transfer Level (RTL) using Verilog
language. Then, a switching activities scenario is created
by injects a long random sequence of flits into the input
ports of the 3D router. Afterwards, this scenario is applied
on Modelsim to generate the VCD file. Then VCD file is
translated to SAIF file. Later, the 3D router is synthesized by
Design Complier (DC) from Synopsys and switching activities
scenario is applied using the SAIF file. The synthesis is
performed on 65 nm technology.
Power consumption results are shown in Fig. 25. As
noticed for high TSV capacitance (CTSV =500 fF, Figs.
(25a,25b,25c)), the difference in power consumption is clearly
against the TSVBOX. But the situation becomes much better
for small TSV capacitance (CTSV =15 fF, Figs. (25d,25e,25f))
because the 2D interconnects power and router power compo-
nents are not affected by the reduction in TSV capacitance,
and only the 3D interconnects power component does.
The trend in TSV fabrication technology is to decrease its
dimensions and hence its capacitance as well. Therefore, and
based on the previous results, the TSVBOX is adaptable to
recent TSV technologies and its power consumption behavior
will enhance more and more for future smaller TSVs.
Each of the various low-power-coding techniques [35] that
geared towards minimizing the number of transitions and
hence power consumption can be applied here. We expect that
applying such technique will reduce the power consumption
of the TSVBOX more and makes it too close or even lower in
consumption than the conventional case. This is due to the fact
that the coupling parasitics (capcitance and inductance) in the
TSVBOX-based 3D NoCs would be less than the conventional
case because the reduction in TSV count results in more space
between TSVs. But for simplicity sake, no professional coding
techniques is considered in our analysis and we dedicate it as
a future work.
F. Yield enhancement and cost reduction versus data bus width
and number of 3D stack layers
To know the TSVBOX effect on die yield, the parameters
of Table IV should be known first. As explored in [15],
microprocessor-like dies are usually large. Sticking to the
assumptions stated in [15], Arouters,cores is assumed to be
100 mm2 for M=4×4. Also we assume that Arouters,cores
has direct dependent on mesh size M, i.e. for M=8×8,
Arouters,cores=
100 mm2×(8×8)
4×4 , etc.. The number of layers is
selected to be in the range from 2 to 8 layers. TSV diameter is
assumed 1 µm to match the ITRS trends for TSV technology.
In [1], MUX area was 16 µm2 for 180 nm technology, so
in this paper we estimate the MUX area for 65 nm using ITRS
suggested scaling factor of 0.7, thus the MUX area would be
16×(0.7)3≈5.5 µm2. Since the MUX circuit is composed of
two transmission gates and some small connection wires, the
transmission gate area can be assumed half the area of the
MUX; ATG≈2.75 µm2. Regarding driver area, we depend on
Conventional TSVBOX (2x1) TSVBOX (4x1) TSVBOX (16x1)0
0.2
0.4
0.6
0.8
1
1.2
1.4
Po
w
er
 (W
)
 
 
Router
2D interconnects
3D (S) interconnects
3D (data) interconnects
(a) Transpose (CTSV=500 fF).
Conventional TSVBOX (2x1) TSVBOX (4x1) TSVBOX (16x1) 0
0.2
0.4
0.6
0.8
Po
w
er
 (W
)
 
 
Router 
2D interconnects 
3D (S) interconnects 
3D (data) interconnects  
(b) Uniform (CTSV=500 fF).
Conventional TSVBOX (2x1) TSVBOX (4x1) TSVBOX (16x1)0
0.1
0.2
0.3
0.4
0.5
Po
w
er
 (W
)
 
 
Router
2D interconnects
3D (S signal) interconnects
3D (data) interconnects
(c) Hotspot (CTSV=500 fF).
Conventional TSVBOX (2x1) TSVBOX (4x1) TSVBOX (16x1)0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Po
w
er
 (W
)
 
 
Router
2D interconnects
3D (S) interconnects
3D (data) interconnects
(d) Transpose (CTSV=15 fF).
Conventional TSVBOX (2x1) TSVBOX (4x1) TSVBOX (16x1) 0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Po
w
er
 (W
)
 
 
Router
2D interconnects 
3D (S) interconnects 
3D (data) interconnects  
(e) Uniform (CTSV=15 fF).
Conventional TSVBOX (2x1) TSVBOX (4x1) TSVBOX (16x1)0
0.05
0.1
0.15
0.2
0.25
Po
w
er
 (W
)
 
 
Router
2D interconnects
3D (S signal) interconnects
3D (data) interconnects
(f) Hotspot (CTSV=15 fF).
Fig. 25: Power comparison for different TSV technologies, i.e. different TSV capacitances.
the number of transistors in its circuit. Since the driver inverter
has two transistors, so we assume also it has half of the area
of the MUX but we should account for the sizing of the S
signal driver (the sizes of the PMOS and NMOS transistors),
therefore Adr−S≈AMUX4 · (3KN−S), where KP−S=2KN−S
to get equal charging and discharging currents during driver
operation for 65 nm technology. The fab constants; α, Do,
and fTSV are assumed to have the same values mentioned in
[1], so α=2, Do=0.004 mm2, and fTSV =10 ppm. Finally, M
is assumed as a parameter taking the values; 4×4, 8×8, and
16×16.
Since die area is very large, the effect of TSV multiplexing
area overhead is not high, and the normalized die yield in Fig.
26 is very close to 1, showing no improvement in die yield
but very small degradation. This degradation is increasing with
the number of layers used, and it is only clear for large M6.
The enhancement in TSV yield is dependent mainly on
the number of TSVs. In Fig. 27, the TSV yield is always
enhanced as long as the number of TSVs is reduced, which
was previously proved mathematically in Subsection VIII-A.
The enhancement in TSV yield is boosted for large number
of 3D stack layers. Extra boosting, comes from the data bus
width, which reflects the total number in TSV yield equation
(Eq. 21). As fewer as the total number of TSVs used (smaller
NBW ), the greater will be the TSV yield. The increase in
TSV yield with reduction in TSV count used is a very strong
exponential relation. As shown in Fig. 27c, when the mesh
size is 16×16, and NBW=1024, there are 16×16×2×1024
which is about half million TSVs between two layers, the
enhancement in TSV yield will be enormous; ≈86.746 million
6Note that the same simulations were done for DTSV =40 and the
normalized die yield were larger than 1 for all cases we tried, but we show
only the case of DTSV =1 as it conforms with ITRS trends.
times than the TSV yield without multiplexing.
As depicted from Eq. 30, the normalized overall W2W yield
is affected by all the parameters affecting both die and TSV
yields. In Figs. (28a,28b,28c), although the die yield was less
than 1, the overall yield is still enhanced; thanks to TSV yield.
Since the overall fabrication yield is improved for all
the cases we tried, the fabrication cost will be reduced, as
depicted from Eq. 31. Though, we do not show the results
because they are only the reciprocal values of the overall
W2W yield figures.
X. CONCLUSIONS
In this paper, the timing requirements analysis of the
TSVBOX-based 3D NoC are investigated and its design
methodology is presented. Using the proposed methodology,
a 4×4×4 mesh topology 3D NoC is modeled using SystemC-
A to verify various aspects of the target design. Analytical
expressions for different 3D interconnect delays were derived
using Elmore-delay estimations. The analytical delay models
are verified using SystemC-A simulations and the error for
different signals delays does not exceed 7.5%, indicating rather
acceptable accurate estimations of our proposed models.
Thereafter, performance comparisons in terms of average
delay and throughput are conducted to investigate the direct ef-
fects of TSV multiplexing on these two metrics. We show that
the TSVBOX does not affect 3D NoC performance under some
conditions dependent on the application traffic pattern, clock
frequency, and the required multiplexing ratio. The TSVBOX
shows very close performance compared to the conventional
case especially in case of Hotspot traffic pattern which models
wide range of on chip processing applications. The side effects
of TSVBOX on power consumption is studied also. We show
that the TSVBOX power consumption becomes closer and
closer to the conventional one as the TSV dimensions gets
2 3 4 5 6 7 80.98
0.985
0.99
0.995
1
 Number of layers N
 
Y d
ie
−n
or
m
 
 
 NBW=64
 NBW=128
 NBW=256
 NBW=512
 NBW=1024
(a) Ydie−norm (M=4×4).
2 3 4 5 6 7 80.98
0.985
0.99
0.995
1
 Number of layers N
 
Y d
ie
−n
or
m
 
 
 NBW=64
 NBW=128
 NBW=256
 NBW=512
 NBW=1024
(b) Ydie−norm (M=8×8).
2 3 4 5 6 7 80.97
0.975
0.98
0.985
0.99
0.995
1
 Number of layers N
 
Y d
ie
−n
or
m
 
 
 NBW=64
 NBW=128
 NBW=256
 NBW=512
 NBW=1024
(c) Ydie−norm (M=16×16).
Fig. 26: Normalized die yield under different data bus widths, number of 3D stack layers, and mesh sizes.
2 3 4 5 6 7 80
0.5
1
1.5
2
2.5
3
3.5
 Number of layers N
 
Y T
SV
−n
or
m
 
 
 NBW=64
 NBW=128
 NBW=256
 NBW=512
 NBW=1024
(a) YTSV−norm (M=4×4).
2 3 4 5 6 7 810
0
101
102
 Number of layers N
 
Y T
SV
−n
or
m
 
 
 NBW=64
 NBW=128
 NBW=256
 NBW=512
 NBW=1024
(b) YTSV−norm (M=8×8).
2 3 4 5 6 7 810
0
102
104
106
108
 Number of layers N
 
Y T
SV
−n
or
m
 
 
 NBW=64
 NBW=128
 NBW=256
 NBW=512
 NBW=1024
(c) YTSV−norm (M=16×16).
Fig. 27: Normalized TSV yield under different data bus widths, number of 3D stack layers, and mesh sizes.
2 3 4 5 6 7 80
0.5
1
1.5
2
2.5
3
3.5
 Number of layers N
 
Y W
2W
−n
or
m
 
 
 NBW=64
 NBW=128
 NBW=256
 NBW=512
 NBW=1024
(a) YW2W−norm (M=4×4).
2 3 4 5 6 7 810
0
101
102
 Number of layers N
 
Y W
2W
−n
or
m
 
 
 NBW=64
 NBW=128
 NBW=256
 NBW=512
 NBW=1024
(b) YW2W−norm (M=8×8).
2 3 4 5 6 7 810
0
102
104
106
108
 
Y W
2W
−n
or
m
 Number of layers N
 
 
 NBW=64
 NBW=128
 NBW=256
 NBW=512
 NBW=1024
(c) YW2W−norm (M=16×16).
Fig. 28: Normalized overall W2W yield under different data bus widths, number of 3D stack layers, and mesh sizes.
smaller which reflects the adaptability of the TSVBOX for the
trend of TSV fabrication technology which intents to reduce
TSV dimensions and hence its parasitic capacitance as well.
Also, we have investigated the effect of the design method-
ology on yield and cost. Though, the TSVBOX adds extra
circuitry, which might need large sized drivers, the effect on
area was minimal and still it can improve enormously the
overall yield and reduce fabrication cost.
APPENDIX A
PROOFS OF EQS. 18, 19, AND 20
For simplicity, this analysis is done assuming an even
number of the number of 3D stack layers (N is even).
A. Performance analysis for Transpose traffic
In Transpose traffic pattern, each node located at (X ,Y ,Z)
coordinates sends all its traffic to the node located at
(XSIZE−X−1, YSIZE−Y−1, ZSIZE−Z−1) [36]. There-
fore, nodes located in layer L0 and layer LN−17 send packets
to each other, so do layer L1 and layer LN−2, etc.. According
to that, packets from each pair of layers encounter the same
extra waiting cycles in TSVBOX-based 3D NoC. The extra
delay cycles for each pair can be calculated as follows: For
(L0,LN−1) pair, packets make (N−1) hops going up from L0
to LN−1 or down from LN−1 to L0. According to Eq. 15,
7We can consider L0 as the first layer (layer #0; usually it is the bottom-
most layer) and LN−1 as the last layer (layer #N−1; usually it is the top-most
layer) in the 3D stack.
those packets suffer from extra waiting cycles (EWCs):
EWCs|(L0,LN−1) = NPKT .(n.NMUX − 1).(N − 1)
Repeating the same for (L1, LN−2) pair, we can deduce that
packets traverse between these two layers suffer EWCs:
EWCs|(L1,LN−2) = NPKT .(n.NMUX − 1).(N − 3)
In general for any pair (Lm, LN−m−1), the EWCs:
EWCs|(Lm,LN−m−1) = NPKT .(n.NMUX−1).(N−2m−1)
The overall average EWCs (EWCavg) is the sum of the EWCs
of all pairs multiplied by 2 (because each layer of any pair
has the same EWCs) and divided by N−1 (N−1 represents
the maximum vertical hops) for the sake of averaging
EWCavg−Transpose = 2.NPKT .(n.NMUX − 1).[
(
N
2
).N − (1 + 3 + 5 + ...+ (N − 1)
]
EWCavg−Transpose = 2.NPKT .(n.NMUX − 1).[
(
N2
2
)− N
4
.(N)
]
EWCavg−Transpose = 2.NPKT .(n.NMUX − 1).N
2
4
(34)
B. Performance analysis for Uniform traffic
In Uniform traffic pattern, each node uniformly distributes
its traffic to all other nodes in the network with equal proba-
bilities. Therefore, each node sends NPKTN on average to each
layer including its own layer. Of course the fraction NPKTN
that is directed to the node’s co-layer nodes does not suffer
EWCs while others going to other layers do. Just like the
previous case of Transpose traffic, the EWCs value depends
on the layer location and it is the same per pair of layers, i.e.
EWCsL0=EWCsLN−1 , EWCsL1=EWCsLN−2 , etc.. The
EWCs of (L0, LN−1) pair is
EWCs|(L0,LN−1) =
NPKT
N
.(n.NMUX − 1).
(1 + 2 + 3 + ...+ (N − 1))
=
NPKT
N
.(n.NMUX − 1).(N(N − 1)
2
)
Doing the same for (L1,LN−2) pair
EWCs|(L1,LN−2) =
NPKT
N
.(n.NMUX − 1).
(1 + 1 + 2 + 3 + ...+ (N − 2))
=
NPKT
N
.(n.NMUX − 1).(1 + (N − 1)(N − 2)
2
)
Continuing in the same way, EWCs for (LN
2 −1,LN2 ) pair
EWCs|(LN
2
−1,LN
2
) =
NPKT
N
.(n.NMUX − 1).[
1 + 2 + 3 + ...+ (
N
2
− 1) +
(
N − (N2 − 1)
)
(N − N2 )
2
]
Now the overall average EWCs (EWCsavg)
EWCsavg−Uniform = 2.
NPKT
N
.(n.NMUX − 1).[(
0 +
N(N − 1)
2
)
+
(
0 + 1 +
(N − 1)(N − 2)
2
)
+(
0 + 1 + 2 +
(N − 2)(N − 3)
2
)
+ ...+((
0 + 1 + 2 + 3 + ...+ (
N
2
− 1))+((N − (N2 − 1))(N − N2 )
2
))]
EWCsavg−Uniform = 2.NPKT .(n.NMUX − 1).[ 1
N
.
( N2 −1∑
i=0
(N − i).(N − i− 1)
2
+
N
2 −1∑
i=0
i.(
N
2
− i)
)] (35)
C. Performance analysis for Hotspot traffic
In Hotspot traffic pattern, we assume that each node sends
some part of its traffic (h.NPKT , 0<h<1) to some hotspot
node/s and the rest of the traffic
(
(1−h).NPKT
)
is uniformly
distributed among all other nodes in the network. In our
evaluation setup, it is assumed that there is a hotspot node
in each layer where it is considered as a hotspot only for its
co-layer nodes. According to that assumption, the analysis for
Hotspot is the same as for Uniform, but NPKT is substituted
by
(
(1−h).NPKT
)
. Thus, EWCsavg for Hotspot traffic can
be calculated by the following equation:
EWCsavg−Hotspot = 2.NPKT .(n.NMUX − 1).[ (1− h)
N
.
( N2 −1∑
i=0
(N − i).(N − i− 1)
2
+
N
2 −1∑
i=0
i.(
N
2
− i)
)]
(36)
REFERENCES
[1] M. Said, F. Mehdipour, and M. El-Sayed, Improving Performance
and Fabrication Metrics of Three-Dimensional ICs by Multiplex-
ing Through-Silicon Vias, 16th Euromicro Conference on Digital
System Design (DSD), pp. 581-586, 2013.
[2] A. Papanikolaou, D. Soudris, and R. Radojcic, Three Dimen-
sional System Integration, Springer, New York, 2011.
[3] I. Loi, S. Mitra, T. Lee, S. Fujita, L. Benini, A low-overhead Fault
Tolerance Scheme for TSV-based 3D Network on Chip Links,
IEEE/ACM International Conference on Computer-Aided Design
(ICCAD), pp. 598-602, 2008.
[4] S. Pasricha, Exploring Serial Vertical Interconnects for 3D ICs,
46th ACM/IEEE Design Automation Conference (DAC), pp.
581-586, 2009.
[5] F. Miller, T. Wild, and A. Herkersdorf, TSV-Virtualization for
Multi-Protocol-Interconnect in 3D-ICs, 15th Euromicro Confer-
ence on Digital System Design (DSD), pp. 374-381, 2012.
[6] M. Said, F. Mehdipour, and M. El-Sayed, Thermal Analysis
of Three-Dimensional ICs, Investigating The Effect of Through-
Silicon Vias and Fabrication Parameters, Electrical Design of
Advanced Packaging and Systems Symposium (EDAPS), pp.
165-168, 2013.
[7] M. Said, F. Mehdipour, N. Miyakawa, and M. El-Sayed, Keep-
Out-Zone Analysis for Three-Dimensional ICs, International
Symposium on VLSI Design, Automation and Test (VLSI-DAT),
pp. 1-4, 2014.
[8] M. Said, F. Mehdipour, k. Murakami, and M. El-Sayed, A Design
Methodology for Performance Maintenance of 3D Network-on-
Chip with Multiplexed Through-Silicon Vias, 3rd ACM Interna-
tional Workshop on Manycore Embedded Systems (MES’15),
June 2015 (to appear).
[9] S. Vangal et al., An 80-Tile 1.28TFLOPS Network-on-Chip in 65
nm CMOS, Proc. IEEE Intl Solid-State Circuits Conf. (ISSCC),
pp. 98-99, 2007.
[10] BookSim 2.0 Users Guide: https://nocs.stanford.edu/cgi-
bin/trac.cgi/wiki/Resources/BookSim.
[11] N. Jiang, D.U. Becker, G. Michelogiannakis, J. Balfour, B.
Towles, ; D.E. Shaw, J. Kim, W.J. Dally, A detailed and flexible
cycle-accurate Network-on-Chip simulator, IEEE International
Symposium on Performance Analysis of Systems and Software
(ISPASS), pp. 86-96, 2013.
[12] D.U. Becker and W.J. Dally, Allocator implementations for
network-on-chip routers, Proceedings of the Conference on High
Performance Computing Networking, Storage and Analysis, pp.
1-12, 2009.
[13] J. Duato, S. Yalamanchili, and L. Ni, Interconnection Networks,
An Engineering Approach, Morgan Kaufmann, San Francisco,
2012.
[14] W.J. Dally and B. Towles, Principles and Practices of Intercon-
nection Networks, Morgan Kaufmann Publishers, 2004.
[15] Y. Chen, D. Niu, and Y. Xie, Cost-effective integration of
three dimensional(3D) ICs emphasizing testing cost analysis,
IEEE/ACM International Conference on Computer-Aided Design
(ICCAD), pp. 471-476, 2010.
[16] B.S. Feero, P.P. Pande, Networks-on-Chip in a Three-
Dimensional Environment: A Performance Evaluation, IEEE
Transactions on Computers, vol. 58, pp. 32-45, 2008.
[17] R. Jagtap, A Methodology for Early Exploration of TSV Inter-
connects in 3D Stacked ICs, Master thesis, TU Delft, 2011.
[18] G. Katti, M. Stucchi, K. De Meyer, and W. Dehaene, Electrical
modeling and characterization of through silicon via for three-
dimensional ics, IEEE Transactions on Electron Devices, vol. 57,
pp. 256-262, 2010.
[19] I. Savidis and E. Friedman, Closed-form expressions of 3-d via
resistance, inductance, and capacitance, IEEE Transactions on
Electron Devices, vol. 56, pp. 1873-1881, 2009.
[20] R. Weerasekera, M. Grange, D. Pamunuwa, H. Tenhunen, and
L.-R. Zheng, Compact modelling of through-silicon vias (tsvs) in
three-dimensional (3-d) integrated circuits, in IEEE International
Conference on 3D System Integration (3DIC), pp. 1-8, 2009.
[21] C. A. Zeferino and A. A. Susin, SoCIN: A parametric and
scalable network-on-chip, Proc. 16th Symposium on Integrated
Circuits and Systems Design (SBCCI), pp. 169-175, 2003.
[22] http://www.systemc-ams.org/
[23] J. Rabaey, A. Chandrakasan, B. Nikolic, Digital Integrated
Circuits, A Design Perspective, 2nd ed., Prentice Hall, New
Jersey, 2003.
[24] N. Weste, D. Harris, CMOS VLSI Design, A Circuits and
Systems Perspective, 4th ed., Addison Wesley, 2011.
[25] http://www.itrs.net/reports.html
[26] K. Banerjee, A. Mehrotra, Power dissipation issues in inter-
connect performance optimization for sub-180 nm designs, IEEE
Symposium on VLSI Circuits, Digest of Technical Papers, pp.
12-15, 2002.
[27] J. Uyemura, CMOS Logic Circuit Design, Kluwer Academic
Publishers, 1999.
[28] M.P. Kennedy, M.A. Awan, and M.S. Asghar, A high frequency
”divide-by-odd number” CMOS LC injection-locked frequency
divider, Journal of Analog Integrated Circuits and Signal Pro-
cessing, vol. 77 , pp. 415-421, 2013.
[29] K. Chandrasekar, Performance Validation of Networks on Chip,
Master thesis, TU Delft, 2009.
[30] P.P. Pande, C. Grecu, M. Jones, A. Ivanov, and R. Saleh, Perfor-
mance Evaluation and Design Trade-Offs for Network on Chip
Interconnect Architectures, IEEE Transactions on Computers,
vol. 54, no. 8, pp. 1025-1040, 2005.
[31] D.R. Avresky, V. Shubranov, R. Horst, and P. Mehra, Perfor-
mance Evaluation of the ServerNetR SAN under Self-Similar
Traffic, 13th International and 10th Symposium on Parallel and
Distributed Processing, pp. 143-147, 1999.
[32] D. E. Knuth, The Art of Computer Programming, 2nd ed.,
Addison-Wesley, 1981.
[33] H. Wang, X. Zhu, L.-S. Peh and S. Malik, Orion: A Power-
Performance Simulator for Interconnection Networks, Proc. MI-
CRO, 2002, pp. 294-395.
[34] P.T. Wolkotte, G.J.M. Smit, N. Kavaldjiev, J.E. Becker, J.
Becker, Energy Model of Networks-on-Chip and a Bus, Proc. of
IEEE International Symposium on System-on-Chip, pp. 82-85,
2005.
[35] N. Jafarzadeh, M. Palesi, A. Khademzadeh, and A. Afzali-
Kusha, Data Encoding Techniques for Reducing Energy Con-
sumption in Network-on-Chip, IEEE Transactions on Very Large
Scale Integration (VLSI) Systems, vol. 22, pp. 675-685, 2014.
[36] S. Koohi, M. Mirza-Aghatabar, S. Hessabi, Evaluation of Traf-
fic Pattern Effect on Power Consumption in Mesh and Torus
Network-on-Chips, IEEE International Symposium on Integrated
Circuits (ISIC), pp. 512-515, 2007.
[37] Gharan, Masoud Oveis, Power and chip-area aware network-
on-chip simulation, Theses and dissertations, 2011.
[38] N. Concer, L. Bononi, M. Soulie, R.Locatelli, and L. P. Car-
loni, The Connection-Then-Credit Flow Control Protocol for
Heterogeneous Multicore Systems-on-Chip, IEEE Transactions
on Computer-Aided Design Of Integrated Circuits and Systems,
vol. 29, JUNE 2010.
[39] A. Pullini, F. Angiolini, P. Meloni, D. Atienza, S. Murali, L.
Raffo, G. De Micheli, and L. Benini, NoC Design and Imple-
mentation in 65nm Technology. IEEE International Symposium
on Networks-on-Chip (NOCS), 2007.
[40] P. K. Sahu, K. Manna, N. Shah, S. Chattopadhyay, Extending
Kernighan-Lin Partitioning Heuristic for Application Mapping
onto Network-on-Chip, Elsevier Journal of systems architecture,
2014.
[41] P. K. Sahu, S. Chattopadhyay, A Survey on Application Map-
ping Strategies for Network-on-Chip Design, Journal of Systems
Architecture, vol. 59, pp. 60-76, 2013.
[42] N. Nethercote, P.J. Stuckey, R. Becket, S. Brand, G.J. Duck,
G. Tack, MiniZinc: Towards a standard CP modelling language,
Principles and Practice of Constraint Programming CP 2007,
Lecture Notes in Computer Science, vol. 4741, pp. 529543,
Springer (2007).
[43] http://www.minizinc.org/
