Abstract-This paper proposes a novel technique to exploit the high bandwidth offered by through silicon vias (TSVs). In the proposed approach, synchronous parallel 3D links are replaced by serialized links to save silicon area and increase yield. Detailed analysis conducted in 90 nm CMOS technology shows that the proposed 2-Gb/s/pin quasi-serial link requires approximately five times less area than its parallel bus equivalent at same data rate for a TSV diameter of 20 μm.
INTRODUCTION
Packet-based communication has been introduced into modern NoCs as a superior concept, compared to traditional bus architectures, to overcome the typical interconnect-related issues such as synchronization problem brought by long wiring [1, 2] . Data packets are transferred between network switches throughout the on-chip network using wide parallel links. By splitting 2D NoC into 3D, data can be transmitted along with the synchronous clock from one tier to another tier directly through "3D" vias instead of enduring multiple hops before arriving at the destined core. These "3D" vias are also referred as Through Silicon Vias (TSVs). However, since even in an advanced 3D integration technology, a single TSV would be of considerable size and pitch, compared to the typical horizontal wiring (bus) width and pitch, any attempt to utilize reasonable bus sizes, e.g. 32 bits, would lead to a large area cost. The expected variability of the 3D process and the need for synchronization would also further exacerbate the problem. Serialization of the wide parallel data and transmission of the data stream through a limited number of TSVs seems to be a promising solution to tackle the area and synchronization problems because of the large bandwidth (Gb/s range) offered by the TSVs which, to our knowledge, has not been exploited.
The paper is organized as follows: In Section II, we introduce a new concept for 3D link design and describe a possible implementation of the link through two case studies. In Section III, we describe a realistic TSV process that can be utilized to implement such links. We also analyze TSVs' electrical performance based on studies about TSV models. In Section IV, we perform a study on how these links would be implemented and how they will perform on higher frequency clocks. Finally, we conclude our work in Section V.
II. 3D INTER-TIER PHYSICAL LINKS
Over years, multiple types of links have been proposed for on-chip communication [3, 4] . One of them, known as "synchronous parallel link", transmits data-words of fixed size along with the clock required for synchronization. Unfortunately, apart from very high demand on area for large data-words, this type of link increasingly suffers from synchronization issues as the size of data-word increases. In order to battle the later problem, "asynchronous parallel link" has been proposed [4] . In the asynchronous case, even though there is no need for clock transmission, additional circuitry required for synchronization may infringe the overall system performance. In both cases, the large number of TSVs required imposes a significant penalty on the overall area, as both cases do not exploit the large bandwidth offered by TSV as proposed in this work. We start with a case study of a 2-Gb/s serial link which is composed of a 8:1 multiplexer (MUX), a 1:8 demultiplexer (DEMUX) and TSVs. Utilizing it as the building component, we propose a new concept of link, referred as "3D quasi-serial physical link". For both cases, we exploit the large transistor density offered by modern sub-100nm CMOS processes. Since most, if not all, TSV fabrication technologies are not linked with the actual transistor scaling, we would expect benefits to be further strengthened as it moves towards smaller CMOS processes. Please note that the bandwidth capacity and the feasibility of high speed TSV links will be shown in Section III.
A. Case study I: 2-Gb/s serial link
The proposed architecture is shown in Fig. 1 . The serial link operates at 2 Gb/s and is designed based on existing designs in [6] (3-Gb/s examples can be found in [7] ). The 8:1 MUX schematic is depicted in Fig. 2 , as proposed in [6] . The circuit generates a high-level pulse of duration 1/fout (fout: MUX's output frequency). The pulse is transmitted through eight shift-registers (labeled as S1 to S8) synchronized with the bit clock of 2 GHz. Data latched in the load registers (labeled as L1 to L8) are extracted and sampled by a retiming flop. The circuit has been implemented in 90 nm CMOS utilizing basic components from a Faraday commercial standard cell library. The back-end phase has been carried out with SoC encounter 7.1 by Cadence®. The entire back-end including placement, clock tree design, and routing has been performed. Finally, a sign-off parasitic extractor has been used to verify the circuit timing after layout. The propagation delay along the critical path in worst-case PVT is 588 ps corresponding to a 1.7 GHz maximum operating frequency. This requires slowing down the NoC clock to 212 MHz. In typical corner, the circuit has a critical path of 500 ps, which complies with the 2 GHz specification. The total area is 22 μm by 20 μm (440 μm2).
1) Circuit design for each tier
The clock division is implemented by two latches, arranged in a master slave configuration. This building block performs a division by two, while further division may be obtained by connecting by-two-dividers in series. The 300 ps skew existing between the 2 GHz and 250 MHz clocks does not compromise the MUX's functionality. Clock signals of 1 GHz, 500 MHz and 250 MHz are all produced in this way. The area of a bytwo-divider is 21.2 μm2, and the circuit can operate at the working frequency in the slow corner.
The 1:8 DEMUX is designed based on the circuit proposed in [7] . The circuit schematic is depicted in Fig. 3 . It consists of a tree of 1:2 DEMUXes. Each tree level operates at different frequency. Fig. 3 also shows the circuit diagram of the basic building block, a 1:2 DEMUX, which is formed by a risingedge triggered flop and a falling-edge triggered flop followed by a rising-edge triggered flop, which in turn samples on the falling edge and outputs data on the rising edge. The DEMUX circuit has been implemented in 90 nm technology. The design could demultiplex 4-bit data at 8 Gb/s, synchronized with 1 GHz clock in the slow corner. The layout area is 10 μm by 44.32 μm (443.2 μm2).
Considering how TSVs can be integrated into the design, we made some simplification. First, TSV's height is limited to approximately 50 μm, much shorter than the wavelength Table III ) is much smaller than the pin capacitance of backplane interconnects, the need for powerhungry low-voltage differential signaling (LVDS) drivers is removed.
2) TSV insertion
The diameter and spacing of TSVs used in this design are both 20 μm. TSV connections from upper tier land on top metal pads while TSVs to connect bottom tier are fabricated from the chip backside and landed on metal2 in the metal interconnection stacks. In this case study, MUX is assigned on Tier 1, and DEMUX is assigned on Tier 2, respectively. Tier 2 is on top of Tier 1. For the MUX on Tier 1, since TSV landing pads are placed on the top metal layer, all the layers under them are able to be used either for routing or placing active devices (e.g. transistors). In this implementation we have placed the MUX core and power rings under the TSV landing pads. For the DEMUX in Tier 2, because metal2 has used for TSV backside landing, active devices are placed away from the TSV connection region. This blocking distance is set to be 10 μm in order to avoid potential interference or misalignment problem during TSV post-processing.
Considering the stress effects on TSVs, it has been demonstrated that longitudinal and transverse stresses influence the carrier nobilities near the vicinity of the TSV [19] . Based on the thermal stress contour maps in [19] , the keep-out-zone estimated as 6.7 μm for the TSV we fabricated. It should be noted that proper design of high-speed serial links will have to take this variation into account, in order to reach the required delay values.
To implement the scheme mentioned above, we exploit the capabilities of commercial EDA tools. At first we generate a LEF file that includes the modeled TSV and landing pad enhanced with an input/output pin which then we specify as an input/output of the circuit. For each signal to be connected to a TSV we additionally connect it to a redundant TSV. Then we place power and ground in order to provide a more fair case of routing study. Under normal conditions these power and ground rings can be shared among at least some parts of the design. Finally we proceed with placement and routing of the design with the TSV pins specified as inputs or outputs of the circuit. 
3) Results
The final layouts for the 8:1 MUX and 1:8 DEMUX are shown in Fig. 4 . The areas for each components and the total silicon area of the link are summarized in Table I . TSV array's area includes the 20 μm spacing between TSVs and the 10 μm blocking from TSV to active devices. So, single TSV's area is considered as 40 μm by 40 μm (1600 μm2).
Since the entire system has been designed using a commercial standard cell library, we experience some performance loss due to 1.7 GHz limit reached by the 8:1 MUX and that limits the NoC clock to 212.5 MHz. A careful full-custom design of flops and combinational gate forming the system would allow further improvement. 
B. Case study II: 8-Gb/s quasi-serial link for NoCs
The complete system architecture for the 8-Gb/s quasiserial link is depicted in Fig. 5 . The main architecture parameters are set according to the Scalable Communication Core (SCC) architecture from Intel [5] . SCC implements a 32-bit physical transfer digit and the NoC operates at 250 MHz. These parameters are applied on our link design. Four 2-Gb/s serial links designed in case study I are placed in parallel and act as a 8-Gb/s quasi-serial link for NoC and this should be able to satisfy the need of NoCs such as SCC. To accommodate to this architecture, clock TSVs are shared among the four 2-Gb/s serial links. And we increase the redundancy of clock TSV to four. The other design rules are all the same as Case study I. The area summary of the link is shown in Table II .
Our experiments show that our TSV process is not suitable for parallel link architecture because a 32-bit bus with doubleredundancy would occupy approximately 0.1 mm2. That would correspond to 40K gates in 90 nm CMOS standard cell library by Faraday used in the paper [9] . This gate complexity corresponds roughly to the area of a modern embedded processor optimized for area. For instance, the ARM9 architecture synthesized in 90 nm TSMC library [10], occupies 0.2 mm2, which is only two times larger than an array of sixtyfour TSVs. The situation is exacerbated due to the presence of multiple parallel buses in a single chip. On the other hand, a quasi-serial link would reduce the area penalty to approximately 14K gates, so it would make the design economically viable. Nevertheless, we are aware that a shift towards more advanced processes (density of 1 Mgate/mm2 or more), would require more aggressive serialization such as 16:1 MUX and 1:16 DEMUX. 
III. TSV MANUFACTURABILITY AND MODELS
The goal of this section is to study and to demonstrate the feasibility of TSVs working in the aforementioned serialized links.
A. Fabricated TSVs
The demonstrated TSV process is developed in Center of MicroNanofabrication, Ecole Polytechnique Federale de Lausanne (EPFL) in collaboration with IMEC® Belgium. A wafer thickness (TSV length) of 50 μm has been targeted with complete filling of copper to realize a void-free copper TSV for implementation in 3D integrated systems. A detailed description of the process steps is out of the scope of this paper. Even though the process has not reached a mature stage, we can envision that fabricating Cu TSVs with 20-μm diameter, 20-μm spacing and 50-μm height is a practical and realistic target. SEM photos of the fabricated TSVs are shown in Fig. 6 .
B. TSV Models
Preliminary works in TSV electrical modelling have been reported in [11] [12] [13] . We observe though, a fair similarity among them for a single TSV's parameters. Models in [11, 12] both consider the coupling in TSV bundles. However, in [12] , the model for inductance calculation is invalid on 200-800 MHz because of the limitation of the simulation tool used. So, the model proposed in [11] is the one we use to calculate the RLC impedance of the TSV.
Looking inside the model proposed in [11] , we can find that the simulation range of the dielectric isolation layer thickness is 0-1 μm while in [12] , the thickness of the dielectric is set to 1 μm. However, the dielectric thickness used in the TSV presented here is 5 μm, and the polymer we use for dielectric isolation has a lower permittivity comparing to silicon dioxide. Considering the first issue, thicker isolation should provide smaller capacitances, so as long as the simulation result follows this rule, it can be accepted as reasonable. To solve the second issue, a correction factor ε polymer /ε SiO2 is multiplied to the capacitances calculated from the model proposed in [11] . As a result, the DC resistance of TSV Rv is calculated as 2.7 mΩ, self capacitance Cs as 23.8 fF, and self inductance Ls as 17.1 pH. Between two adjacent TSVs, coupling capacitance Cc is calculated as 4.0 fF, and mutual inductance Lm as 4.5 pH. Parameters of various TSV structures are listed in Table III. For a frequency range above 1 GHz, signal-ground (S-G) or ground-signal-ground (G-S-G) structures are required [13, 14] . A TSV equivalent circuit model is proposed in which the parasitic were extracted by fitting the measurement-based S parameter up to 20 GHz [13, 14] . Thus, the electrical model of TSVs in multiple-gigahertz frequency is limited on specific TSV structure.
C. TSV timing analysis
In our 3D physical links described in Section II, 1 GHz clock and 2-Gb/s data rate are assumed. We performed a Spectre® simulation of the TSV model previously described to ensure signal integrity. To perform the simulation we assume that one link starts with a minimum-sized inverter INV1 as a driver, followed with a 50 μm metal2 wiring, a TSV fabricated in our process, another 50 μm metal2 wiring, and ends with another INV1. The timing model is shown in Fig. 7 . All the parasitics for metal2 wiring, i.e. Rw, Cs_w, Cc_w, are from 90 nm technology specification. Wirings are assumed to be of minimum width. The coupling between two wires Cc_w is neglected, since the two wires can be far away from each other when routed to TSVs. To get the minimum delay from the 50 μm wiring metals, a 20X inverter INV20 replaces the first minimum-sized inverter. The simulation result is shown in Fig.  8 . The delays with and without TSVs are summarized in Table  IV . The addition of a TSV leads to both a degradation of rise and fall time and a larger crosstalk between the signals. Using stronger driver INV20 instead of INV1 enables to restore timing responses, although it brings larger crosstalk during switching.
The study reported in [20] shows that the insertion losses of TSVs with height of 50 μm and diameter of 40 μm can be minimize by using ground CPW to CPW 3D transition up to 170 GHz. The characteristics of the developed structures have shown that the insertion loss of these interconnects is mainly due to the high dielectric loss of the T-lines rather than the TSVs [20] . In our case, it is reasonable to expect similar performance when TSV diameter is reduced to 20 μm. 170 GHz bandwidth can support around 56.7-Gb/s digital signal data when considering sixth harmonics of digital signal data. This bandwidth is sufficient high to support the serialized links with high transmission rate of up to 40 Gb/s which is required for our discussion in Section IV.
IV. EXTENSION STUDY
In the previous sections, we have demonstrated that serial and quasi-serial links can bring great benefits on area and design complexity. We will discuss further about the feasibility of utilizing this idea on links which would require higher working frequencies if serialized. After checking the serial link designs in existing works [16, 17] , some conclusions are be reached. For example, for serial links working at frequencies higher than 3 GHz, current-mode logic (CML) is a better solution to enable the timing while for frequencies higher than 10 GHz, inductors need to be added into the design. Following these rules, we build the theoretic area study in this section as follows. The 90 nm technology standard cell areas are used for circuits designed in conventional CMOS technology working at maximum 3 Gb/s. The standard cell areas in a CML library designed in our laboratory are used for circuit designed at maximum 10 Gb/s. When inductive peaking technique is required, the MUX2:1 and DEMUX1:2 designed in [16] is borrowed because it has a more uniform structure, and ensures a more reliable system when cascaded into the rest part of the MUX/DEMUX link. The inductor size is also taken from 90 nm standard cell library. Even as in [17] two inductors are implemented interleaved, it takes the area for a single inductor that is already very area-consuming. In our area model, we assume the two inductors in one CML standard cell occupy one single inductor's silicon area. The TSV arrays' area is calculated by simply adding all the single TSV areas together. The specific arrangement on locations of TSVs to achieve better signal integrity is not considered. Double redundancy is used for signal TSVs; four-time redundancy is used for clock TSVs. For the clock divider, we use the one presented in [18] which is designed without inductors and achieves a very small size of 9.2 μm by 5.2 μm. The total area of the 3D inter-tier link is the sum up of all the areas for the MUX, the DEMUX, the clock divider and the TSV array. Since CML circuit and inductors are involved in the design, we simply assume that active devices are blocked from TSV array. Thus, MUX's area is calculated into the total area. For the circuits design in CMOS logic, a 1/0.7 correction factor is multiplied to the CMOS logic area calculated to account for the exact area needed for the routing of the signal and the power/clock rails; a 1/0.6 correction factor is multiplied to the CML area for the same reason. We admit that routing signals from the active area to the TSV array also consumes some area. But this is casespecific and thus is not taken into account in this theoretical study. As can be seen in Fig. 9 , through serialization, a maximum of 75% area saving can be achieved. When the link width increases to more than 64 bits, serial link's area can even reach double the size of its parallel counterpart because inductors are introduced. By using quasi-serial strategy, links with wider width can be implemented with much smaller area and work under lower maximum clock frequency. 
V. CONCLUSION
In this paper, we exploit the large bandwidth offered by the state of the art TSV technology, and utilize it on the inter-tier link design. The proposed inter-tier quasi-serial link achieves five times less area than the traditional synchronous parallel link. This approach can be considered as a low-cost and efficient inter-tier communication solution for 3D NoC designs.
