Abstract-A 3D stacked IC is made by multiple dies (possibly) with heterogeneous process technologies. Therefore, die-to-die variation in 2D chips renders on-package variation (OPV) in a 3D chip. In spite of the different variation effect in 3D chips, generally, 3D die stacking can produce high yield due to the smaller individual die area and the averaging effect of variation on data path. However, 3D clock network can experience unintended huge clock skew due to the different clock propagation routes on multiple stacked dies. In this paper, we analyze the on-package variation effect on 3D clock networks and show the necessity of a post silicon management method such as body biasing technique for the OPV induced 3D clock skew control in 3D stacked IC designs. Then, we present a parametric yield improvement method to mitigate the OPV induced 3D clock skew.
I. INTRODUCTION
As CMOS process technology scales down, manufacturing variation becomes severe and on-chip variation highly affects the chip performance. In the meantime, through-silicon via (TSV) based 3D IC design technique is actively researched to alleviate long interconnection delay and to improve the chip yield by splitting a large design into several dies. As a result, dieto-die variation in 2D chips renders on-package variation (OPV) in a 3D chip. In spite of the different variation effect in 3D chips, generally, 3D die stacking can produce high yield due to the smaller individual die area and the averaging effect of variation on data path [1] [2] [3] . However, 3D clock network can experience unintended huge clock skew due to the different clock propagation routes on multiple stacked dies. Consequently, for the accurate and right timing closure of 3D IC designs, it is necessary to consider the on-package variation effect on 3D clock networks.
In recent years a number of researches have studied 3D clock network designs. Temperature dependent clock skew control and power analysis for H-tree based clock network topologies are presented in the works [4] [5] [6] . Mondal et al. [4] proposed a thermally adaptive clocking scheme to reduce the temperature dependent clock skew. Arunachalam and Burleson [5] proposed a low power clock design by using a separate layer for the clock network. Pavlidis, Savidis, and Friedman [6] compared clock skew and power consumption for various H-tree based clock network topologies with real measurement data.
For the buffered clock tree designs, 3D clock tree synthesis (CTS) algorithms are studied in the works [7] [8] [9] [10] . Minz, Zhao, and Lim [7] minimize and balance temperature dependent skew by relocating merging points of an initial zero skew clock tree. They also provided detail analysis to show low power clocking property of the multi-TSV approach compared to the single-TSV approach in the work [8] . Kim and Kim proposed low cost and low power 3D CTS solution while guaranteeing a minimal use of TSVs in the work [9] and Manuscript received Aug. 22, 2011; revised Dec. 1, 2011 . School of Electrical Engineering and Computer Science, Seoul National University, Seoul, Korea E-mail : {takyung, tkim}@ssl.snu.ac.kr a non-zero skew bounded 3D clock tree routing algorithm in the work [10] . 3D CTS methodology with pre-bond testability is enabled by the work [11] and optimized by the work [12] . For the process variation, Xu et al. [13] analyzed the process variation induced clock skew for scaled 2D and 3D ICs and Yang et al. [14] proposed a process variation aware 3D CTS methodology. However, the work in [13] is only limited to the simple H-tree which is only practical to the regularly placed sink elements. The work in [14] did not optimize the die-todie variation in 3D stacked ICs. Moreover, the variation reduction method for clock buffers in [14] inherently can sacrifice too many TSVs.
For the yield improvement of 3D ICs, a number of researches proposed wafer or die matching methodologies to maximize functional yield, parametric yield, and the corresponding profits [16] [17] [18] . Smith et al. [16] showed the yield trend of 3D ICs compared to 2D ICs and an idea to improve the functional yield of wafer-to-wafer bonded 3D ICs by choosing wafer pairs whose patterns of working dies best match each other. Reda, Smith, and Smith [17] proposed wafer assignment algorithms to maximize the functional yield of 3D stacked ICs in wafer-to-wafer integration. In [18] , Ferri, Reda, and Bahar presented die matching algorithms in die-to-wafer and die-to-die 3D integration to maximize the parametric yield and profit with regard to performance and leakage constraints for a 2-die stacked speed binning 3D processor made by CPU and L2 cache dies.
Even though yield improvement works already exist for 3D stacked ICs, they only consider functional yield of individual dies or parametric yield of critical path models. The work in [17] did not consider parametric yield loss. The work in [18] needs a critical path model representing the whole system behavior. Importantly, the aforementioned 3D yield improvement works [16] [17] [18] did not consider different timing behavior of the 3D clock network compared to the 2D one. Moreover, 3D yield analysis works in [1] [2] [3] using the averaging effect of variation on data path can be more meaningful when we guarantee correct clock signal arrival under a tolerable 3D clock skew budget.
In this paper, we propose a post silicon management technique of on-package variation induced 3D clock skew in die-to-wafer and die-to-die 3D IC integration styles. To the best of our knowledge, this is the first work mitigating the on-package variation effect, i.e., die-to-die variation effect, on 3D clock timing to guarantee the correct operations across the whole design. Precisely, the contributions of the work are the followings: (1) we introduce the concept of on-package variation (OPV) in 3D ICs and its effect on the system timing behavior; (2) we analyze the on-package variation effect on 3D buffered clock trees and show the necessity of a post silicon skew control method such as body biasing technique to manage the on-package variation effect in 3D IC designs; (3) we present a die matching strategy to maximize the recovery chance of 3D clock skew (deviated from the expected budget) with body biasing technique.
This work extends the preliminary work in [15] to include a complete management scheme for the onpackage variation induced 3D clock skew. The remainder of this paper is organized as follows. In section II, the concept of 3D clock tree and the related on-package variation effect on design timing are presented. Then, we overview the body biasing technique used for device tuning in section III and present a parametric yield improvement method to maximize the recovery chance of the deviated 3D clock skew in section IV. Experimental results are provided in section V to show the non-negligible on-package variation effect on the 3D clock skew and the effectiveness of the body biasing technique on the on-package variation reduction. Finally, a conclusion of the work is given in section VI.
II. 3D CLOCK TREE AND ON-PACKAGE VARIATION
Since clock network synchronizes the whole circuit elements, it is required to optimize design timing and power with less resource usage. In other words, 3D clock tree should be optimized in terms of the number of TSVs, wire, and buffer resources. Such high quality clock trees fully utilizing 3D design space can span entire dies as shown in Fig. 1 [7] [8] [9] [10] [11] [12] [13] [14] . To safely apply highly optimized 3D clock trees distributed on multiple dies to a main production process of 3D ICs, we need to consider a new 3D process variation issue which is a special item distinguished from the conventional process variation for 2D clock trees.
A 3D stacked IC is made by stacking independently manufactured dies. This means that each die assembled in a 3D package can be manufactured at different process corner compared to the other dies. In other words, die-todie process variation in 2D ICs acts as the on-package variation in a 3D IC. Fig. 2 (a) shows an example of a 2-die stacked 3D IC. Conventionally, during timing closure activity, we assume that all the circuit elements are manufactured at the same process corner as shown in Fig.  2 (b). However, there are additional process corner combinations to consider the on-package variation derived from the die-to-die variation as shown in Fig.  2 
(c).
If we do not consider the on-package variation, we cannot guarantee the correct operation of a final stacked 3D IC due to the possibility of speed degradation and (especially) system function failure. Fig. 3 shows examples of the on-package variation effect on design timing. Logic gates on the launch clock and data paths (CP2, FF2, and DP) are manufactured on die2, and gates on the capture clock path (CP1 and FF1) on die1. If die2 is a slow corner sample and die1 is a fast corner sample as shown in Fig. 3(a) , the conventional all slow (or all fast) corner-based timing sign-off shown in Fig. 2(b) does not work. Die-to-die variation between 2D dies becomes on-package variation in a 3D system and chip performance can be degraded due to the unintended fast clock propagation on the capture clock path (CP2). Similarly, if die2 is a fast corner sample and die1 is a slow corner sample as shown in Fig. 3(b) , system function failure due to the unintended hold time violation can occur for the fast data paths with small logic gates.
As shown in Fig. 3 , the conventional all slow (or all fast) corner-based timing sign-off cannot guarantee the correct operation of 3D stacked ICs due to the unintended delay mismatches caused by the on-package variation. Therefore, the on-package variation effect should be taken into account for 3D IC designs, more importantly for the commercial ASIC designs that demand high yield even at corner cases.
III. POST SILICON 3D CLOCK SKEW MANAGEMENT WITH BODY BIASING TECHNIQUE
As shown in section II, independently manufactured dies of a 3D stacked IC can exhibit worse timing behavior and we must consider the on-package process variation effect. On the contrary, if we consider new process corners of 3D stacked ICs in design time, design turnaround time (TAT) will be greatly increased due to the exponential increase of the number of timing sign-off corners in terms of the number of stacked dies. For example, for 2-die stacked 3D ICs in Fig. 2 , the number of timing sign-off corners increases from 2 1 = 2 corners to 2 2 = 4 corners. So, not to increase the additional timing sign-off corners of 3D IC designs, it is necessary to control the device characteristics of the manufactured dies. Body biasing [19, 20] is one of the most effective control techniques of device characteristics at post silicon stage. As illustrated in Fig. 4(a) , body biasing technique controls device characteristics by changing the body voltage of NMOS and PMOS transistors. If the NMOS (PMOS) body bias voltage is higher (lower) than the source voltage as shown in Fig. 4(b) , it is called forward biased with increased speed and leakage. If the NMOS (PMOS) body bias voltage is lower (higher) than the source voltage as shown in Fig. 4(c) , it is called reverse biased with reduced speed and leakage. To mitigate the on-package variation, we apply global body biasing voltages to individual dies. (If we need to consider the on-die variation, we shall apply multiple body biasing voltages across 2D planes. In that case, a block-level body biasing method similar to that in [19] can be used with sensor and body bias voltage control circuitry similar to those in [4, 21, 22] .)
By applying the body biasing to manufactured dies as a post silicon tuning method, we can mitigate the onpackage variation induced 3D clock skew. However, it is known that the allowable amount of body biasing voltage is limited depending on the control capability of voltage regulator and the reliability characteristics of device such as negative bias temperature instability (NBTI), hot carrier injection (HCI), time dependent dielectric breakdown (TDDB), breakdown voltage (BV), latch-up characteristic, and leakage current [20] . Fig. 5 shows buffer delay trends at various process corners with varying body bias voltages under 1.2 V power supply voltage in 45 nm process technology. If we are allowed to use body bias voltages in 0.5 V as shown in Fig. 5(a) , we can tune the device characteristics into the green rectangular region by applying a proper body bias voltage. However, if we are only allowed to use body bias voltages in 0.2 V as shown in Fig. 5(b) , we cannot tune all the devices to have similar delay. For example, devices at points A and C can be tuned to points B and D by applying 0.2 V FBB and RBB voltages. However, devices at regions E and F cannot be tuned to have similar delay with each other because the allowable body bias voltage is limited. So, if we want to use body biasing as a post silicon 3D clock skew management technique, we need a smart method to maximize the parametric yield with regard to 3D clock skew under a limited body biasing voltage range.
IV. PARAMETRIC YIELD IMPROVEMENT OF 3D CLOCK SKEW
As shown in section III, body biasing is an effective technique to control device characteristics in post silicon stage. However, if an allowable body biasing voltage range is narrow, its effect can be very limited. To maximize the recovery chance, i.e. parametric yield, of the deviated 3D clock skew under a limited body biasing voltage range, in this section we present a die matching strategy of 3D stacking dies. First, we propose a matching strategy for 2-die stacked 3D ICs, and then discuss the general case (the number of stacking dies > 2). We assume die-to-wafer and die-to-die 3D IC integration styles because the die matching in wafer-to-wafer bonding is very limited. In addition, to only assess the parametric yield of 3D clock skew, we assume that all the stacking dies are functional and the process corner for each die is known with wafer level die testing.
For two sets of process corner identified N dies, we want to find a die matching with body biasing voltages, which maximizes the parametric yield of 3D clock skew, under a given body bias voltage range (V BB.MIN ~ V BB.MAX ). In this paper, if post silicon 3D clock skew is within the expected clock skew (estimated when all dies are manufactured at the same slow corner), we decree that the chip is good, otherwise bad. process corners of two sets of N dies by using conventional wafer level testing, which measures threshold voltage (Vth), drain saturation current (Ids), and other characteristics of transistors on scribe lane TEG (Test Element Group), or specially designed onchip monitoring circuits [21, 22] . Without loss of generality, we assume that we can identify process corner of each die in 10% step between slow/fast corner and nominal corner as illustrated in Fig. 6(a) . Then, we make a die matching comprising N 2-die stacked 3D ICs maximizing the parametric yield of 3D clock skew.
After identifying the manufactured process corner of each die, we need to know what kind of matching produces as many as good 3D ICs. A naive approach is to simulate the whole 3D clock tree with various body bias voltages for each die. However, the simulation process for the whole 3D clock tree can be time consuming. To this end, we predefine body biasing voltages of stacking two dies for each process corner combination by preprocessing the clock buffer delay trend in 
where c1 and c2 are process corners of two dies, and delay(c) represents the delay range achievable under a given body bias voltage range at a process corner c. Fig.  7(a) shows an example of BUFBB variable calculation results under 0.2 V allowable body bias voltage range. During BUFBB calculation, body biasing voltages for two dies (VBB1(c1,c2) and VBB2(c1,c2)) are also selected so that the buffer delay difference between two corners can be minimized. An example is shown in Fig.  7(b) . Now, we know the post silicon tuning possibility for each process corner combination of stacking two dies and the corresponding body biasing voltages. Therefore, we can match two sets of N dies to maximize the parametric yield of 3D clock skew. We find an optimal matching by applying classical Hungarian algorithm [23, 24] , which optimally computes the maximum graph matching or assignment in a bipartite graph, as does in the works [17, 18] . To apply Hungarian algorithm, we construct a bipartite graph with 2N vertices representing two sets of process corner identified N dies and N 2 edges having cost from BUFBB(c1,c2) in Eq. (1). Then, our die matching problem for 2-die stacked 3D ICs can be solved in O(N 3 ) time. For the general stacking more than two dies, ILP formulation can be used as does in the work [18] .
V. EXPERIMENTAL RESULTS
To analyze the on-package variation effect, we constructed 3D buffered clock trees by applying the flow and algorithms in Fig. 8 [9] . An abstract clock tree topology is obtained by performing the 3D extension of the work [25] , and clock tree routing is done by performing the 3D extension of DME [26] with the minimum number of TSVs [9] .
Benchmark circuits from ISPD clock network synthesis contest [27] are used under the Elmore delay model. In addition, 45nm Predictive Technology Model from NCSU FreePDK [28] based on ASU PTM [29] is used for SPICE simulation. For 3D CTS of ISPD benchmark circuits, we transformed them into 2-layered 3D placements with reduced die footprint by a factor of To make our CTS be more realistic, we set the clock frequency to 1 GHz under 1.2 V supply voltage while constraining max slew rate to 100 ps (10% of the clock period) with maximum loading capacitance of 300 fF. Table 1 summarizes initial 2-layered 3D CTS results without the consideration of on-package variation. The first column shows the benchmark circuits. The next four columns show the number of TSVs allocated, the total wirelength, the number of buffers inserted, and the clock propagation delay of 3D clock trees, respectively. For the easy understanding of 3D clock tree structure, a 2-layered 3D CTS result for the ispd09fnb1 benchmark circuits is shown in Fig. 9 . Fig. 9(a) and (b) show the clock trees on die1 and die2, respectively. The whole 3D clock tree is shown in Fig. 5 (c) including TSVs (big red dot) and buffers (small blue dot). For the constructed 3D clock trees, we analyze the onpackage variation effect on 3D clock skew. Fig. 10 and Table 2 show the on-package variation and body biasing effects on clock skew for the CTS results in Table 1 . In  Fig. 10 , the skew values of every combination of process corners of the two dies are plotted. (The red dot indicates the averaged skew value of each combination.) If we consider the on-package variation, the average skew value of 28.18 ps with the conventional timing sign-off (i.e., slow corner for both two dies) can be 92.91 ps for the different process corner combination of each die (i.e., fast corner for one die and slow corner for the other die), which is almost 10 % of the clock period. The reason for the worse clock skew is from the fact that a 3D clock tree can span entire dies and the resulting clock tree suffers from the on-package variation. In Table 2 We have shown that the on-package variation induced 3D clock skew can be mitigated by applying body biasing technique. However, the allowable amount of body biasing voltage is limited depending on the control capability of voltage regulator and the reliability characteristics of device [20] . In this paper, without loss of generality, we assume that we are allowed to use body biasing in the range of 0.2 V to evaluate our die matching strategy under a limited body biasing voltage range. Based on the clock buffer delay trends, we make die matching for randomly manufactured two sets of process corner identified 1000 dies (N = 1000) under the Gaussian distribution by applying Hungarian algorithm as illustrated in section IV. We decree that a bonded 3D IC is a good chip if the post silicon tuned 3D clock skew ispd09f11  21  14  63  70  47  55  33  43  25  31  18  20   ispd09f12  33  23  86  76  70  60  56  46  43  34  30  26   ispd09f21  28  22  134  142  102  109  74  79  49  53  25  29   ispd09f22  26  22  121  89  95  63  72  40  50  18  30  19   ispd09f31  30  24  126  129  94  99  66  72  41  47  22  30   ispd09f32  39  29  117  104  94  80  73  63  55  47  38 Fig. 10 . Plot of the on-package variation and body biasing effects on clock skew for 2-layered 3D buffered clock trees in Table 2 .
is under the initial clock skew when all dies are manufactured at the slow corner. Table 3 shows the parametric yield results of our die matching strategy with body biasing compared to the yield oblivious die bonding method without body biasing technique. The first column shows the benchmark circuit names. The next two columns show the yield results under 0 ps skew margin. By using our proposed die matching strategy, we can improve the parametric yield by 42.22 % on average compared to the yield oblivious method. In spite of the great yield improvement with body biasing method, a chip designer may want to use a simple design margin (e.g., additional skew margin) instead of body biasing. The last two columns show the yield results when we can utilize additional 5 ps skew margin. The yield oblivious method can increase only 16.42 % parametric yield compared to those without any skew margin. Moreover, if we utilize additional 5 ps skew margin for our die matching strategy, we can provide very high yield of 99.71 % on average, which is 45.24 % increase from the baseline.
VI. CONCLUSIONS
In this work we observed that the on-package variation caused by the stacking of dies in 3D IC integration often rendered extremely high clock skew, and showed that the application of a simple coarse-grained body biasing to individual dies was able to considerably mitigate the increase of OPV induced 3D clock skew. Moreover, compared to the yield oblivious method, our proposed die matching solution can considerably increase the parametric yield of 3D clock skew by mitigating the onpackage process variation effect. 
