3D integration is an emerging technology that overcomes 2D integration process limitations. The use of short Through-Silicon Vias (TSVs) introduces a significant reduction in routing area, power consumption, and delay. Though, there are still several challenges in 3D integration technology need to be addressed. It is shown in literature that reducing TSV count has a considerable effect in improving yield. The TSV multiplexing technique called TSVBOX was introduced in [1] to reduce the TSV count without affecting the direct benefits of TSVs. The TSVBOX introduces some delay to the signals to be multiplexed. In this paper, we analyse the TSVBOX timing requirements and deduce a design methodology for TSVBOX-based 3D Network-on-Chip (NoC) to overcome the TSVBOX speed degradation. Performance comparisons under different traffic patterns are conducted to verify our solution. We show that TSVBOX-based 3D NoC performance is highly dependent on the NoC traffic pattern and in most simulation scenarios we tried, it shows almost the same performance of the conventional 3D NoC.
INTRODUCTION
2D integration proves to have many limitations for nowadays large systems needs. On the other side, 3D integration is an emerging technology that can mitigate these significant issues. However there are still some challenges that need more and more focus to make such promising technology mature and reliable. One of these challenges is the reliability issue in terms of yield. 3D-ICs show Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. very low yield due to extra fabrication steps of bonding dies or wafers to each other to create the 3D stack. These extra steps may result in some faulty TSVs [2] . The probability of having faulty TSVs increases as the total number of TSVs increases. Hence, finding a technique that reduces TSV count without affecting the benefits gained by 3D integration is crucial. The TSV multiplexing technique introduced in [1] reduces the number of TSVs by half, by multiplexing each two 3D signals 1 into one signal and passing this signal through one TSV instead of two. Therefore almost 50% reduction in TSV count is achieved. Due to this significant reduction in TSV count, the analysis done in [1] on yield has revealed significant improvement over conventional 3D-ICs. The TSVBOX uses extra selection signal (SEL) to control a multiplexer/demultiplexer assembly. This SEL signal introduces some delay to one of the multiplexed signals besides the parasitics of the TSVBOX itself. Such delay may affect the functional validity of the system to be implemented using TSVBOX. Although, [1] addresses most of the issues related to TSVs multiplexing, such delay problem has not been studied yet. Due to its scalability and novelty as a multicore communication architecture for future multiprocessor system-on-chips (MPSoCs), 3D NoC is selected as our target system architecture for applying the TSVBOX technique. Using this architecture, timing requirements and the design methodology for TSVBOX-based 3D NoC are investigated so that the performance degradation is avoided. The rest of this paper is organized as follows: Section 2 introduces TSVBOX technique. Section 3 explores the target 3D NoC architecture. Section 4 highlights the TSVBOX circuit model and various design aspects. Section 5 introduces the TSVBOX timing analyses and our design methodology that tackles delay degradation of the TSVBOX. The simulation results are discussed and justified in Section 6. Finally, Section 7 concludes the paper.
TSVBOX
As shown in Fig. 1a , the TSVBOX is composed of a multiplexer (MUX), a demultiplexer (DeMUX), and a TSV in between. The selection signal (SEL), shown in Fig. 1b , is used to control the circuit. Referring to [1] , using the TSVBOX circuit, V1 and V2 signals can be transferred over one TSV instead of two as in the conventional 3D-ICs. Ideally, V2 will pass with no delay, thus V2 =V2, and V1 will be delayed by TH so V1 (t)=V1(t − TH ). Therefore TH should be selected such that; TH TCLK in general. The proof of this concept and more detailed analysis are in [1] . Fig. 2 shows our target 3×3×2 mesh topology 3D NoC architecture. The packet size N is assumed to be equivalent to the data bus width. For the conventional 3D NoC, the whole 3D data bus width is N +2, where the extra two bits are required for the synchronous handshaking protocol which needs request (REQ) and acknowledgement (ACK) signals. For the TSVBOX case, the data bits of the packet are multiplexed and hence N 2 +2 TSVs are required. However for each vertical bus width two extra TSVs are required to transfer the SEL signal and its inverted version SEL. Therefore, the vertical connection bus width is N 2 +4 for the TSVBOX-based 3D NoC. For simplicity, Store-And-Forward and XYZ deterministic routing are chosen to be the 3D NoC switching technique and routing algorithm, respectively. Each router is a 3D router with sixth ports (North, South, East, West, Local, and Vertical). Each port has one FIFO buffer of five packets size, where the packet size is 32 bits. Each router is connected through the local port to a processor core which is modeled as a random packet generator/consumer for the router.
THE TARGET 3D NOC ARCHITECTURE

MODEL SETUP
Since Spice models for such system would be too complicated and time consuming either in design or simulation, SystemC-A is used for our 3D NoC implementation. Processor cores, routers, and intra-layer interconnects, are modeled as a high level microarchitectures, while for inter-layer interconnects (the vertical connections); TSVs and TSVBOX, low level circuit implementation is used to be able to accurately determine different delays. Different design considerations are listed as follows: • 3D path model: As shown in Fig. 3a , the 3D signal is assumed to pass through an input inverter driver, a global wiring segment in the first layer, a TSV, a global wiring segment in the second layer, and an output inverter driver. The output inverter driver is assumed always 1x-inverter (minimum size inverter).
• TSV and wiring parasitics: For the TSV and wiring circuit models, the models introduced in [4, 5] are used ( Fig. 3b and 3c, respectively). According to [6] , CT SV and RT SV can be assumed to be 15 fF and 1Ω, respectively. Wiring is assumed to be global. The length of the global wires is assumed 200 µm [6] . Global wiring parasitics are stated in Table 1 [7].
• Transistor model: The circuit models for both nMOS and pMOS are shown in Fig. 3d and 3e , respectively. The model parasitics and technology parameters are depicted in Table 1 , based on 180 nm TSMC transistor model from TSMC [3] .
• Transmission gate transistors design: The TSVBOX is composed of four transmission gates, and according to [8] , transmission gate transistors are usually selected to be minimum sized (KN =KP =1, where KP and KN are the sizes of pMOS and nMOS, respectively). Thus both nMOS and pMOS of the transmission gate will be selected such that; WP =LP =WN =LN , where WP ,WN and LP ,LN are the widths and lengths of pMOS and nMOS, respectively.
• Load capacitance: Since inverter driver load is assumed 1x-inverter, the input capacitance can be expressed as:
where WP =1.5LP for pMOS to ensure equal currents during charging and discharging phases.
• ON-OFF threshold voltages of inverter drivers: There are two input thresholds: VinL−max which is the maximum low input voltage required to switch pMOS ON and nMOS OFF at the same time. Therefore if Vin≤VinL−max, nMOS will be OFF and pMOS will be ON. The other threshold voltage is VinH−min, which is the minimum high input voltage required to switch nMOS ON and pMOS OFF at the same time, therefore if Vin≥VinH−min, nMOS will be ON and pMOS will be OFF. For digital circuits, VinL−max and VinH−min can be calculated as follows: 
Global wiring segments
Output 1x-inverter driver where |V thP |, V thN , and VDD are shown in Table 1 .
• ON-OFF threshold voltages of transmission gate switches: nMOS/pMOS switches ON when VgN ≥V thnN /|VgP |≥|V thP |, respectively. Therefore, max(V thnN , |V thP |) is enough to meet both conditions. Following the same analysis, the OFF condition is min(V thnN , |V thP |). Since V thnN ≈|V thP | in most technologies, so for simplicity we can assume that max(
Conventional 3D signal path modeling
The conventional 3D-IC 3D signal path is shown in Fig. 4 where the signal is assumed to pass through an inverter driver (represented by its ON resistance R dr−Conv ), the global wiring of the first layer, the TSV, the global wiring of the second layer, and the load capacitance which is the input gate capacitance of a 1x-inverter driver in the second layer. The conventional 3D NoC 3D signal path delay (T d−Conv ) can be approximated using Elmore-delay as follows:
where the wiring and TSV resistances are neglected to simplify the analysis due to their very small values.
TSVBOX 3D signal path modeling
Fig . 5 shows the TSVBOX circuit model. It is similar to the circuit of the conventional 3D signal path. The difference is that the equivalent RC parasitic circuit of the transistors in MUX and De-MUX is involved. Since there is no transistor models in SystemC-A, the switching behavior of the transistors of the transmission gate is modeled using perfect switches. The SEL signal controls the upper transmission gates of the MUX and the DeMUX, and its inverted version SEL controls the lower ones. Therefore both the lower and upper transmission gates will switch ON or OFF exclusively as required in the original TSVBOX design. The SEL signal path shown in Fig. 5 , is similar to the conventional 3D signal path. However, the SEL signal is driving the gates of the transmission gates. Therefore for each SEL path, there are four gate capacitances (4Cg) involved in the load; 2Cg in MUX and 2Cg in DeMUX. Then, for N 2 data bus width, the TSVBOXes will contribute with N · (CgN + CgP ) in total to the SEL signal load. Elmore-delay for both V1 and V2 can be approximated according to the following equation:
where
RonP + RonN While the delay of the SEL signal can be approximated by:
5. TIMING REQUIREMENTS ANALYSIS
Synchronous communication protocol
Fig . 6 shows the synchronous handshaking protocol between two routers which can be explained as follows:
(1) The transmitting router initiates a request to the receiving router by setting the REQ signal to '1'. At the same time it puts a data packet (PKT) on the data bus.
(2) After at least one cycle, the initiated request will be recognized by the input controller of the receiving router. If there is at least one free slot in the FIFO buffer, the packet will be read and the acknowledgement ACK signal will be set to '1', announcing successful reception of the packet.
(3) Upon detecting the ACK signal, after one cycle, the transmitting router will reset the REQ to '0'. Resetting REQ is the end of the communication procedure.
According to the above procedure, the packet must be ready at least one clock cycle after initiating the request. Therefore, the SEL signal can be chosen to be the clock signal itself. If so, in the first half of the clock cycle V1 (V2) will be transmitted, and vice versa in the second half cycle. As a result to this, it can be concluded that TH =TL= 
Power gating to SEL signal
It is observable from Fig. 6 that the SEL signal is only needed during packet transfer operation. To decrease the overhead in power consumption originated from the SEL signal, the power gating technique is applied to it using the REQ signal, as shown in Fig. 7 . This transmission gate introduces some extra delay to the SEL path. In order to reduce this delay, the transmission gate is positioned before the input inverter driver of the SEL signal. For further reduction in delay, a 1x-inverter driver is introduced between the power gating transmission gate and the main inverter driver of the SEL signal. Therefore only the input capacitance of the 1x-inverter driver is considered in calculating this extra delay. The power gating extra delay can be approximated as:
where KN =1 and KP =1.5, are the sizes of the nMOS and pMOS transistors of the 1x-inverter driver, respectively.
Avoiding concurrent ON state of TSVBOX switches
As shown in Fig. 8a . At t=0.5TCLK , SEL and SEL will start discharging and charging, respectively. However, SEL will reach the ON threshold max(V thN , |V thP |) at T1, while SEL will reach the OFF threshold min(V thnN , |V thP |) at t=T2. This because the SEL was '0' and SEL was VDD, and '0' is much closer to max(V thN , |V thP |), than VDD to min(V thnN , |V thP |). It is observable that, during T1≤t≤T2, all the transmission gate switches of the TSVBOX are ON which violates the theoretical behavior of the TSVBOX. To avoid that problem the SEL should discharge fast enough, to reach min(V thnN , |V thP |) at t=T1, as shown in Fig.  8b . This means that the driver resistance of the SEL signal should be smaller than driver resistance of SEL.
Minimum clock period for TSVBOX
As shown in Fig. 8 , the period of time between 0.5TCLK ≤t≤TCLK , can be divided into two smaller periods: T d−SEL , and the remaining time till the clock edge Trem. During Trem, V2 signal is required to reach an acceptable level '0' or '1', therefore we must select Trem≥T d−T SV BOX . Based on these observations the minimum clock signal for the TSVBOX can be expressed as follows:
5.5 Different design flows for 3D interconnect Fig. 9 introduces two design flows for elaborating the 3D interconnects of conventional and TSVBOX-based 3D NoCs. As shown, the first step in both design flows is calculating technology dependent constants and parameters, i.e. RC parasitics, threshold voltages, etc. All other steps in both design flows are concerned with the design of the drivers. The driver resistance can be considered the average value of RonP and RonN ;
where KP ≥1 and KN ≥1 are the sizing factors of the pMOS and nMOS transistors, respectively. For all drivers, we choose KP =1.5KN to achieve equal currents during charging and discharging. Consequently, Eq. 8 can be rewritten as:
For conventional 3D NoC, first T d−Conv is to be selected such that; T d−Conv ≤TCLK . Then R dr−Conv is calculated using Eq. 3. However, since the driver has minimum size (minimum width and length for each transistor), therefore if the calculated value of the R dr−Conv was larger than the maximum driver resistance (when the size is minimum), R dr−Conv must be set to the maximum allowed driver resistance R dr−max . After that, and based on R dr−Conv value, KN−Conv can be determined from Eq. 9. In a straightforward way, similar steps can be followed for the TSVBOX to calculate its delays and the corresponding driver sizes.
Mitigating delay degradation
Referring to Eqs. 3 and 4, the conventional 3D NoC outperforms TSVBOX-based one. What makes the situation worse, that data transfer through the TSVBOX is done serially (V1 during TH , then V2 during TL). Using GALS (Globaly Asynchronous Locally Synchronous) concept, this performance degradation can be mitigated if the vertical communication through the 3D interconnect is done with different clock frequency than other modules of the 3D NoC. According to [9, 10] , all the communications between routers is synchronous, the modules running with different clocks can communicate correctly with each other using REQ and ACK signals. Since TSVs usually shorter in length than other interconnects, then the critical path delay (CP ) is always expected to be larger than the TSV delay. As shown in Fig. 10 
SIMULATION RESULTS
3D signal path delays
Two frequencies are selected that meet the requirements introduced in Subsection 5.6. Therefore TCLK−3D is selected to be 2.5 and 9.3 nsec, for conventional and TSVBOX-based 3D NoCs, respectively. These values enable us to choose the data drivers with minimum size making the changes only in SEL (SEL) signal driver, which simplifies the design process. Other router modules run at TCLK =CP , where CP is hypothetically assumed as a parameter taking the values: T d−Conv , 3T d−Conv , and 5T d−Conv (2.5, 7.5, and 12.5 nsec, respectively). Based on TCLK−3D; all conventional and TSVBOX delays and their associated driver sizes are selected and determined according to the design flows presented in Subsection 5.5 and depicted in Tables 2 and 3 . As shown, the error between the theoretical and the simulation delays does not exceed 14%, which indicates the acceptable accuracy of Elmoredelay model. Also, it is observable that both T d−P G−SEL and T d−P G−SEL are found to be negligible (0.075 and 0.06302 nsec, respectively), and do not affect our proposed design methodology.
Performance comparison
In order to test the performance under different scenarios, two traffic patterns are selected to work with; Matrix Transpose (MT) and HotSpot (HS). In MT each node located at (X,Y,Z) sends all its traffic to the node located at (SX -1-X, SY -1-Y, SZ -1-Z), where (SX =3, SY =3, SZ =2) represent the 3D NoC size. In HS, we assume there is a hotspot node in each layer, to which 50% of the total packets injected in that layer are destined. Therefore the two hotspots would consume on average 50% of the total traffic of the 3D NoC. The packet injection process of each traffic flow is a Poisson random process where the random time intervals between successive injections follow exponential distribution [11] . Poisson distributed injection rate is adopted because it successfully characterizes the performance of multiprocessor applications [12] . Two performance metrics are studied; average delay and throughput. The average delay of a packet is defined as the total cycles taken by the packet to cross the network towards its destination node. That delay spans from the creation of the packet, to when it is ejected at the destination, including source buffer queuing time. For the average throughput, it is defined as the average ejection rate of the packets at their destination nodes. The simulation warm-up period is set to 1000 cycles in which we avoid calculating results until the network get congested [13] . The simulation runs with 3600 packets; 200 packets are injected by each node, and continued at the prescribed packet injection rate till these packets have all been received, and their average delay and throughput are calculated. It is clear that under MT traffic 100% of the injected packets are crossing the 3D interconnect while this percentage reduces to ≈ 25% in HS case. Fig. 11 lists the performance comparison between TSVBOX-based and convectional 3D NoCs. It is clear from Figs. 11(a,b,c,d,g,h,i,j) , that the performance degradation in HS is less than the degradation 11(e,f,k,l), for both traffics, as the CP to the conventional 3D interconnect delay ratio becomes larger, the TSVBOXbased 3D NoC shows very close performance to conventional 3D NoC. Because larger CP would mask the delay degradation of the TSVBOX, and reduce its delay effect on the overall performance. The case of Figs. 11(g,h) shows the worst case among all other cases. In this case the CP ratio is 1 and since it is MT traffic all injected packets suffer from the delay degradation of the TSVBOX.
CONCLUSIONS
In this paper, the design methodology of the TSVBOX-based 3D NoC is presented in details. Using the proposed methodology, a 3×3×2 mesh topology 3D NoC is implemented using SystemC-A to verify various aspects of the target design. Then, various simulations were conducted to compare the conventional and TSVBOXbased 3D NoCs in terms of delay and throughput. The TSVBOX shows almost no performance degradation versus conventional 3D NoC, especially if the main clock period is larger than the conventional 3D interconnect delay, and also when less number of packets traverse between 3D stack layers as shown in HotSpot traffic case. 
