Pre-bond testing of 3D stacked ICs involves testing individual dies before bonding. The overall yield of 3D ICs improves with prebond testability because designers can avoid stacking defective dies with good ones. However, pre-bond testability presents unique challenges to 3D clock tree design. First, each die needs a complete 2D clock tree for the pre-bond testing. In addition, the entire 3D stack needs a complete 3D clock tree for post-bond testing and normal operations. In the case of two-die stack, a straightforward solution is to have two complete 2D clock trees connected with a single Through-Silicon-Via (TSV). We show that this solution suffers from long wirelength and high clock power consumption. Instead, our algorithm minimizes the overall wirelength and clock power consumption while providing the pre-bond testability and post-bond operability under given skew and slew constraints. Compared with the single-TSV solution, SPICE simulation results show that our multi-TSV approach significantly reduces the clock power by up to 15.9% for two-die and 29.7% for four-die stack. In addition, the wirelength reduction is up to 24.4% and 42.0%.
INTRODUCTION
3D system integration has emerged as a key enabling technology to continue the scaling trajectory predicted by Moore's Law for future IC generations. Using 3D integration, the average and maximum communication distance between components placed on different dies can be substantially reduced, which further translates into significant savings on delay, power and area. Moreover, it enables the integration of heterogeneous devices, making the entire system compact and efficient. Nevertheless, the success of 3D stacked ICs is predicated on the final post-bond yield, i.e., to minimize the chances of bonding good dies with defective ones together. In other words, pre-bond testability must be provided prior to the bonding process to test each individual die, which may consist of partial functionality. * This material is based upon the work supported by the National Science Foundation under CAREER Grant No. CCF-0546382, Grant No. CCF-0811738, the Center for Circuit and System Solutions (C2S2), and the Interconnect Focus Center (IFC).
To tackle the testing issues for 3D stacked ICs, several testing methods were investigated. In [13] , Wu et al. proposed 3D scan chain design approaches to improve testability. The stitching wirelength is minimized in their work. Lewis and Lee presented an architectural solution in [8] to the pre-bond testability for 3D diestacked microprocessors. They discussed how to perform testing for functional modules that are splitted into multiple dies. They also investigated new design methods in [9] to address similar testing issues caused by partially functional pre-bond circuits.
Minz et al. [10] presented a 3D clock routing algorithm under the wirelength minimization goal. Their 3D tree has a unique property, where only one of the dies in the stack contains a fully connected 2D clock tree while the other dies contain many small, isolated subtrees. While this algorithm takes the advantage of TSVs to shorten the total wirelength of the clock signal, it makes pre-bond testing very difficult as a large number of test probes will be required to provide synchronous clock signals through these TSVs for testing those dies without a fully connected 2D clock tree. Their technique shows that multiple TSVs help reduce wirelength and clock power but complicate the pre-bond testing. Our work aims at addressing these issues and providing methods to design a pre-bond testable clock tree for 3D stacked ICs.
The contributions of our work are as follows: (1) We present the first work on pre-bond testable clock routing. We propose a new circuit element called TSV-buffer, which supports zero-skew pre-bond testing for the clock trees that use multi-TSVs. We also introduce so called redundant tree, which supports the pre-bond testing of dies that do not contain a fully connected clock tree. We show that these circuit elements are essential in supporting efficient pre-bond testing while minimizing the overall wirelength and clock power. (2) In order to improve the reliability of our prebond testable 3D clock tree, we develop a slew-aware merging and buffering method to keep the slew rate at clock sinks under the given constraint. This method also helps reduce wirelength and power consumption of the pre-bond testable 3D clock tree. (3) Compared with a straightforward solution, which uses a single TSV in between two dies for pre-bond testability, our solution reduces the wirelength and clock power consumption by up to 24.4% and 15.9% for two-die, and 42.0% and 29.7% for four-die stack.
TESTABLE CLOCK ROUTING
The pre-bond testable 3D clock routing problem is defined as follows: Given a set of sinks distributed on N dies (N > 1) and an upper bound of TSV count, the goal is to construct a 3D clock tree such that during post-bond operations, the tree connects all the sinks and provides the clock signal with minimum skew; 1 and during pre-bond testing, a 2D clock tree, together with one test probe for each die, provides the clock signal to the sinks on the die with minimum skew. The objective is to minimize the wirelength and clock power under the given TSV budget and clock slew constraints.
Overview
We first develop a pre-bond testable clock routing algorithm for two-die stack. We then extend it to handle more than two dies in Section 2.6. The input to our algorithm includes the location and capacitance of sinks on both dies (= die-0 and die-1), and an upper bound of TSV usage (> 1). Assume that die-0 contains the clock source. Our algorithm consists of the following two main steps: (1) 3D tree construction: the goal is to generate a 3D clock tree connecting all the sinks on both dies so that (a) the overall 3D tree has zero skew under the Elmore delay model, (b) total wirelength is minimized, and (c) die-0 contains a fully connected 2D tree with zero skew. In this case, the 3D tree is used during postbond testing and normal operations, and the 2D tree on die-0 is used for pre-bond testing of die-0. We utilize so called "TSV-buffer" to make sure that the 2D tree on die-0 maintains zero skew during both pre-bond and post-bond. (2) Redundant tree routing: if multiple TSVs are used, the 3D tree construction step generates a 3D tree, where die-1 contains several sub-trees that are not connected. The goal of the redundant tree routing step is to connect the roots of the sub-trees on die-1 and form a single fully connected 2D tree so that (a) the skew is zero, and (b) the total wirelength is minimized. This 2D tree is used for the pre-bond testing of die-1. In addition, the additional tree used to connect the roots, so called "redundant tree", is disconnected during the post-bond operations. We use transmission gates (= TGs) to connect and disconnect this redundant tree.
Review of Existing Work
We use the 3D-MMM algorithm [10] to generate the abstract tree for the 3D clock sinks in a top-down manner. The basic idea is to recursively divide the given sink set into two subsets until each sink belongs to its own set. We then visit each sink in a bottomup fashion and start merging subtrees until all sinks are connected via a single tree. At each recursive partitioning step, we divide the given sink set into two subsets A and B. The following two cases are considered based on the TSV bound of the current sink set: (1) If the TSV bound is one, the current sink set is partitioned such that the sinks on the same die belong to the same subset. The connection between A and B needs one TSV. (2) If the TSV bound is greater than one, the current sink set is flattened to 2D (z-dimension is ignored) and partitioned geometrically by a horizontal or vertical line. Since each subset contains sinks from both dies, we potentially need many TSVs to connect them.
At the end of partitioning, we decide the TSV bound for each subset as follows: (1) estimate the number of TSVs required by each set, and (2) divide the given bound according to the ratio of estimated TSVs. The cut direction is determined such that the TSV bound is balanced in both subsets.
During the embedding and buffering step, the internal nodes of the 3D abstract tree are placed, and buffers are inserted under zeroskew constraint. The classic DME algorithm [6] is extended to generate topology embedding for the given 3D abstract tree. A cost function that considers capacitance of buffers, TSVs and wires is used in buffer insertion.
TSV-Buffer Insertion
Pre-bond testability of die-0 requires a fully connected clock tree on die-0 so that the minimum-skew clock signal is delivered to all FFs on die-0 using one test probe. As mentioned earlier, if multiple TSVs are used, the 3D tree construction step gives a 3D tree, where die-0 contains a fully connected tree and die-1 contains a forest. During pre-bond testing, we separate the two dies and test them individually. In this case, the 2D tree on die-0 can be used without any additional modification. However, the clock skew of this tree may no longer be zero because the downstream capacitances of some branches on die-1 are not present after the separation. This additional skew will slow down the testing process.
Our strategy to avoid this skew degradation during the pre-bond testing of die-0 is to employ so-called "TSV-buffers". A TSVbuffer is simply a buffer inserted right before a TSV. In our testingaware DME (= TaDME) algorithm, we add a TSV-buffer for each inter-die connection that requires a TSV and route the tree accordingly under the zero-skew goal. In this case, the TSV-buffers are inserted on die-0, where the clock source is located. Since the buffers shield off all the downstream capacitance, die separation for prebond testing will not cause any change to the delay at the sinks on die-0. The outcome of TaDME is a zero-skew 3D tree that contains a zero-skew 2D tree on die-0 after die separation.
A key step in our TaDME algorithm is bottom-up recursive tree merging. Given a pair of zero-skew sub-trees to be merged, our goal is to locate the merging point and connect it to the root nodes of the sub-trees so that zero skew is maintained in the merged tree. Figure 1 (a) shows the traditional merging process used in the original DME algorithm, where the location of a merging point E is determined based on the parasitics of TSV, wires, the downstream capacitance and internal delay of the two sub-trees. In this case, if the right branch of the overall tree is removed, i.e., TSV, edge (E, A), and CT2, the delay from E to B will change due to the change on the downstream capacitance at point E. However, if merging with a TSV-buffer, the delay from E to B in Figure 1 (b) will not change even if we remove the right branch. This is because the TSV-buffer hides the downstream capacitance at point E .
The following notations are used in Figure 1 : r and c denote the unit length wire resistance and capacitance, respectively. R d is the output resistance of a buffer, CL is the input capacitance of a buffer, and t d is the intrinsic delay of a buffer. RTSV and CTSV are the resistance and capacitance of a TSV. Die-0 contains a subtree CT1 with the root B. It has loading capacitance CLB, and the internal delay from B to the sinks of CT1 is tB. Similar symbols are used for CT2. A clock wire of length l is modeled as a π-type circuit with a resistor (rl) and two capacitors (cl/2). TSV is modeled as a wire with resistance of RTSV and capacitances of two CTSV /2. Note that the downstream capacitance at the merging point E in Figure 1 (b) is cl E B +CLB +CL before and after the die separation for testing. Thus, TSV-buffers allow us to build a zero-skew 3D tree that contains a zero-skew 2D tree on die-0 after die separation.
During the bottom-up merging process, we require that the delay from E to sinks of CT1 through B (= d E ,CT 1 ) be equal to that from E to sinks of CT2 through
where tA is the internal delay from A to sinks of CT2, and CLA is the downstream capacitance of node A. L is the merging distance between A and B. The location of merging point, i.e., l E A and l E B can be determined by solving these equations.
Redundant Tree Insertion
Pre-bond testability of die-1 requires a fully connected clock tree so that the minimum-skew clock signal is delivered to all the FFs on die-1 using one test probe. As mentioned earlier, when multiple TSVs are used for wirelength reduction, the 3D tree construction step generates a 3D tree, where die-1 contains a forest. Therefore, our goal is to combine these sub-trees on die-1 into a single fully connected clock tree so that the clock skew is zero and the overall wirelength is minimized. We accomplish this by adding an additional tree, so called "redundant tree", that connects the roots of the sub-trees while maintaining zero skew. We use this fully connected tree during the pre-bond testing of die-1. Note that the redundant tree is not to be used during the post-bond testing and operations. Our strategy is to use TGs (= transmission gates) to connect and disconnect the redundant tree.
The redundant tree routing is done using a conventional approach. Given the roots of the sub-trees on die-1, we construct a binary abstract tree in a top-down fashion. We then insert a TG at each root node. Next, we embed and buffer the abstract tree using the classical DME algorithm [6] under the zero-skew and minimal wirelength goals. Lastly, we connect the enable input of the TGs using an extra control wire. In order to minimize the routing resource overhead, we minimize the total wirelength of this control signal. We use the RMST-pack [11] for this purpose. Section 4.3 provides results on how significant this overhead (= redundant tree and TG control signal) is.
Putting It Together
Upon the completion of our algorithm, we obtain a fully connected zero-skew 2D clock tree for die-0 and die-1 each as well as a fully connected zero-skew 3D tree for the entire stack. In the case of die-1, we turn on the TGs to connect the redundant tree to the sub-trees on die-1. These two zero-skew trees are used during prebond testing. Once the pre-bond testing is completed, we turn off the TGs to disconnect the redundant tree from die-1. At this point, the original zero-skew 3D tree is used for post-bond testing and normal operations. We show in our experimental section that our approach that relies on the usage of multiple TSVs, TSV-buffers, TGs, and the control signal consumes significantly less power compared with a simple solution, where a single TSV is used to connect two separate zero-skew trees on die-0 and die-1.
Multiple-Die Extension
Our pre-bond testable clock tree algorithm for two-die stack can be easily extended to handle more than two dies. Our basic 3D tree construction algorithm presented in Section 2.2 generates a 3D tree, where the die that contains the clock source (= die-0 in this case) has a single fully connected tree, while all the other dies have a forest. The basic approach remains the same: during the bottomup merging process, we insert a TSV-buffer at each TSV location on die-0 only. If the TSV does not connect to die-0, no TSV-buffer is required. Note that the TSVs connecting non-adjacent dies, e.g., die-0 and die-2, are assumed to be stacked on top of each other. In this case, we just need a single TSV-buffer on die-0. Once the TSV-buffer insertion and embedding/buffering are completed, we add redundant trees for all the other dies that contain a forest. We add transmission gates at the roots of all the sub-trees, and provide a global control signal to connect all the transmission gates on each die. The outcome of the whole process is: (1) a single zero-skew 3D clock tree for post-bond testing and normal operations, (2) a zero-skew 2D clock tree for each die to enable pre-bond testing.
SLEW-AWARE BUFFERING

Wirelength Balancing with Clock Buffers
Our 3D clock tree algorithm inserts two kinds of buffers: clock buffers and TSV-buffers. Clock buffers, as discussed in Section 2.2, are mainly used to control delay and skew. These clock buffers are usually inserted closer to the clock source and drive large loads to reduce the delay along the clock paths. The TSV-buffers, as discussed in Section 2.3, are inserted at every TSV location on the clock source die to make sure that the clock tree included on the clock source die maintains zero skew during pre-bond testing. Our observation indicates, however, that TSV-buffers usually cause the wirelength to be unbalanced during the bottom-up merging process. The reason is that if two sub-trees CT1 on die-0 and CT2 on die-1 are merged, we are forced to add a TSV-buffer on die-0. As shown in Figure 1(b) , TSV-buffer insertion increases the delay from E to CT2. Depending on the internal delays and downstream capacitance of the two sub-trees, if tA > tB, the TSV-buffer enlarges the unbalance of internal delays, causing (E , B) to become longer. If the delay difference is very large, we may even require wire snaking to balance the delay for zero skew. This wirelength unbalance may cause the overall clock wirelength overhead on die-0 to become non-negligible.
Our strategy to tackle this problem is by adding extra clock buffers to balance the internal delays and thus the length of the related wires. Specifically, if the wirelength unbalance caused by TSVbuffer insertion is significant, we insert an extra clock buffer on the other branch to balance the internal delays. In Figure 1(b) , we will add an extra clock buffer along E -B. We observe that this delay balancing with extra clock buffer insertion eventually reduces the overall wirelength on die-0. We observe that the number of clock buffers used for wirelength balancing is usually low because the wirelength unbalance does not occur frequently.
Slew Rate Control with Clock Buffers
Clock slew rate control is an important reliability issue for highspeed clocking. If the slew rate is too low, i.e., if it takes too long time for the clock signal to rise to 1 or fall to 0, the FF setup and hold time are affected, which will eventually slow down the clock. Existing work on slew-aware clock tree synthesis relies on buffer insertion [12, 4, 5, 7] . Buffers are added along the clock paths so that the output load of each buffer is limited to a certain upper bound. This bound, denoted CMAX in the literature, is shown to be effective in improving the slew rate: smaller CMAX value improves the slew rate at the cost of more buffers inserted. Most existing works insert buffers to a given clock tree as a post process to improve the slew rate under various constraints including buffer area, clock power, etc. A limitation of this post-synthesis slewaware buffering is that buffer insertion needs to be done carefully not to increase the clock skew. This may impose constraint on the location of buffers.
Our strategy to tackle the slew rate issue is by adding buffers under the CMAX constraint during clock tree synthesis. Specifically, we insert clock buffers, together with TSV-buffers, during the bottom-up merging process so that the CMAX constraint is satisfied for both types of buffers. We add clock buffers along the paths from the merging point to the sub-tree root nodes if the downstream capacitance at the merging point exceeds CMAX. Depending on the load, we may insert more than one clock buffer to meet the CMAX requirement. Figure 2 shows several possible scenarios for clock/TSV-buffer insertion. In summary, our clock tree synthesis algorithm uses three criteria to insert buffers during the bottom-up merging process: (1) we add a TSV-buffer for every TSV connecting to the clock source die (for pre-bond testability), (2) we add a clock buffer if the wirelength between the merging point and two sub-tree roots are not balanced (for wirelength reduction), (3) we add clock buffers if the downstream capacitance of any buffer exceeds the given upper bound, namely CMAX (for slew rate control)
EXPERIMENTAL RESULTS
We implemented our algorithm using C++/STL on Linux. We use five benchmarks from the IBM suite [1] and four from the ISPD clock network synthesis contest suite [2] . Since these designs are for 2D ICs, we obtain 3D designs by randomly partitioning the clock sinks across multiple dies and scaling the footprint area by √ 2 and √ 4 for two-die and four-die stacks, respectively. We use the technology parameters based on 45nm PTM [3] : the unit-length wire resistance is 0.1Ω/μm, and the unit-length wire capacitance is 0.2fF/μm. The sink capacitance values range from 5fF to 80fF . The buffer parameters are: R d = 122Ω, CL = 24fF , and t d = 17ps. We use 10μm × 10μm via-last TSVs with thinned die height of 20μm. The TSV parasitics are: RTSV = 0.035Ω, and CTSV = 15.48fF . Clock frequency is set to 1GHz with supply voltage of 1.2V . Clock skew is constrained to 3% of the clock period. Clock slew constraint is set to 10% of the clock period. Correspondingly, the maximum load capacitance of each buffer (= CMAX) is 300fF for the slew rate control. Out pre-bond testable clock routing algorithm generate zero-skew clock trees under the Elmore delay model. We will report all the clock-related metrics such as delay, skew, slew and power consumption based on SPICE simulation.
TSV-buffer and TG Model Validation
In pre-bond testable clock routing, we utilize two new circuit elements, namely TSV-buffer and transmission gates (= TGs), to facilitate pre-bond testing and post-bond testing/operations. TSVbuffers are used to shield off the downstream capacitance of die-0 (= the die that contains the clock source), which helps maintain zero skew during the pre-bond testing of die-0. TGs are inserted to support the pre-bond testing of other dies that contain sub-trees. Figure 3 shows the equivalent circuit models used for the SPICE validation of TSV-buffers and TGs. We simulate a post-bond 3D clock tree for two-die stack along with the pre-bond testable 2D clock tree on die-0 and die-1. Node A is the clock source for postbond operation. Sink C on die-0 and sink E on die-1 have loading capacitances of CLC and CLE, respectively. Node B and D are connected by a TSV-buffer and its TSV. The edge (D, E) is a sub-tree on die-1 and is connected to F , its pre-bond testing clock source, via a TG. CLC and CLE are set to 5fF . Wire (A, B), (B, C), (D, E) and (F, D) all have length 500μm. First, we observe from SPICE simulation that the delay from A to C in Figure 3(a) is 42.21ps, which is the same as that from A to C in Figure 3(b) . This verifies that our TSV-buffer maintains the delay to the sinks on die-0 after separating die-1 for pre-bond testing. Second, the TG has 14.2fF capacitance between node D and the ground when it is off. This TG completely blocks the clock signal from A to F . When TG is on for the pre-bond testing of die-1, however, it has 108Ω between its input and output nodes, 16.4fF between its input and the ground, and 18.4fF between its output and the ground. The intrinsic delay of a TG is 1.04ps. Under this model, the calculated delay from F to E is 54.13ps, which closely matches with the simulated delay of 54.14ps. Figure 4 shows a series of pre-bond testable clock trees of the circuit r1 from the IBM suite based on TSV upper bound of 10. Figure 4(a) is the zero-skew 3D clock tree for post-bond testing and normal operations. The solid and dotted lines represent the clock trees on die-0 and die-1, respectively. It contains 10 TSVs denoted by the black dots. The triangle is the clock source. Note that die-1 contains many sub-trees (= dotted lines) that are not connected. Figure 4(b) shows the zero-skew pre-bond testable 2D clock tree for die-0, which is identical to the solid clock tree in Figure 4 (a). Figure 4(c) shows the zero-skew pre-bond testable 2D clock tree for die-1, which contains a redundant tree that connects the root nodes of the sub-trees on die-1 (= dotted lines). Figure 5 shows two groups of clock waveforms for benchmark r5, where each group contains 25 waveforms for 25 sinks in each tree. The first group (shown on top) is from the post-bond 3D clock tree, whereas the second group (shown on bottom) is from the prebond testable 2D clock tree for die-0. We first observe that the waveforms among 25 sinks are almost identical, which is desirable. In addition, the two groups have similar waveforms, which demonstrates that the TSV-buffer helps maintain waveforms for the post-bond and pre-bond testing. Second, the SPICE simulation shows that the clock skew among all sinks in both cases is 29.1ps, which can also be observed by the width of waveforms at 50% Vdd. Third, the maximum slew rate is 88.4ps, which is measured as the rise time from 10%-to-90% of Vdd (or 90%-to-10%). Both the skew and slew values satisfy our constraints (3% and 10% of clock period, respectively). Table 1 shows the wirelength (μm), power consumption (mW ), and skew (ps) results for the post-bond 3D clock tree (= denoted T3D), the pre-bond testable 2D clock tree for die-0 (= denoted T0) and die-1 (= denoted T1). For die-1, we report the total wirelength (= WL), wirelength of the sub-trees (= WL-sub), redundant tree (= WL-red), and the TG control signal (= WL-TG). In this case, the WL of T1 is equal to the sum of WL-sub and WL-red. In addition, the WL of T3D is sum of the WL of T0 and the WL-sub of T1.
Sample Trees and Waveforms
Wirelength, Skew, and Power Results
Based on the wirelength-related columns, we see that (1) the total pre-bond testable clock tree for die-0 wirelength (= WL) of T0 and T1 are comparable, (2) in several cases, the wirelength of the redundant tree (= WL-red) is about 2x the total wirelength of the sub-trees (= WL-sub) on die-1, (3) in several cases, the wirelength of the TG control signal (= WL-TG) is about half the redundant tree in die-1 (= WL-red). Note that the redundant tree and the TG control signal are used only during the pre-bond testing of die-1. This non-negligible overhead is compensated by the significant power saving to be discussed in Section 4.4. The skew values do not exceed 30ps, which satisfies our skew constraint that is set to 3% of the clock period. In terms of clock power consumption, die-0 consumes more power than die-1 primarily due to the TSV-buffers added on die-0.
Comparison with Single-TSV Approach
Our baseline 3D clock tree contains a single fully connected zero-skew clock tree on each die, and these trees are connected TSV bound with a single TSV for two-die stack and a single "stacked TSV" for more than two dies in the stack. Table 2 shows the comparisons of wirelength, clock power and skew results based on SPICE simulation. The runtime for each circuit is less than one second in all cases.
First, our multi-TSV approach significantly outperforms the single-TSV approach in terms of wirelength: 14.8%-24.4% for two-die stack, and 39.2%-42.0% for four-die stack. Similarly, power saving for the clock tree is 10.1%-15.9% for two-die, and 18.2%-29.7% for four-die stack. These results convincingly demonstrate the benefits of our multi-TSV approach. Second, the #Bufs columns show the total number of buffers used in the trees, including both the clock buffers and TSV-buffers. We see that the total number of buffers used is comparable between the single-TSV and the multi-TSV approaches. In the case of the single-TSV approach, buffers are inserted mainly to control wirelength and slew in each die. As for our multi-TSV approach, it shows different trends in terms of buffer usage between the two-die and four-die cases: most of the buffers are clock buffers that are used for wirelength and slew control in two-die stack; in contrast, most of the buffers are TSVbuffers for the four-die cases. Since the TSV-buffers also have positive impact on wirelength and slew in the four-die cases, we do not need too many clock buffers. Third, the clock skew values are all below the constraint of 30ps in both cases. Figure 6 shows the impact of TSV bound on wirelength, buffer count, and clock power consumption. These metrics are normalized to the baseline results, which are based on the single-TSV approach. The x-axis corresponds to the TSV bound used to build our multi-TSV pre-bond testable 3D clock tree. Note that the actual TSV usage may be different from this bound because the clock tree synthesis algorithm determines itself the optimal number of TSVs to be used for wirelength minimization. When TSV bound is set to infinity, the actual TSV usages is 3097 for benchmark r5.
Wirelength and Power Results
We first observe that the wirelength consistently reduces as more and more TSVs are used in our 3D pre-bond testable clock tree. The wirelength saving reaches 45% if the TSV bound is set to infinity. This shows that TSVs in general help reduce the overall wirelength of 3D clock tree. Second, the total number of buffers used (counting both the clock buffers and TSV-buffers) increase as more and more TSVs are used. This is mainly due to the TSV-buffers inserted for pre-bond testability. Taking both the wirelength and buffer trends into consideration, the power consumption reduces, consistently but slowly, as more and more TSVs are used. The maximum power saving for r5 is around 18% compared with the single-TSV case, when the 3D clock tree uses around 2500 TSVs across all four dies. If more than 2500 TSVs are used, the power consumption goes back up, mainly due to the excessive TSV-buffer insertion. This trend allows us to choose the right TSV bound for a given power budget. If the power saving of 10% is required, the TSV bound is set to 300. Table 3 shows the impact of CMAX (= the maximum output load each buffer can drive) on skew, maximum rise-slew and maximum fall-slew among all sinks on all dies. We use four-die stack of benchmark r1 and compare the single-TSV with our multi-TSV approaches. We observe that as the CMAX value increases, the maximum rise and fall slews for both single-TSV and multi-TSV cases increase. In other words, tighter CMAX means better slew.
Impact of CMAX on Power and Slew
All of the slew values are below the constraint, 10% of the clock period, which is 100ps. The slew values are slightly smaller in case of multi-TSV, mainly due to (slightly) more buffers inserted for slew control. In terms of skew, the trend is not obvious for the single-TSV case. However, skew tends to reduce with a tighter CMAX value for the multi-TSV case. The main reason is that wirelength is shorter in these cases, which cause the clock buffers added for slew control to have positive impact on delay and skew as well. Figure 7 shows the impact of CMAX on clock power consumption. We use four-die stack of r1 for this experiment. The overall trend is the same in both single-TSV and multi-TSV cases: tighter CMAX results in more power consumption. This is because more clock buffers are inserted to meet the tight CMAX constraint. However, the power benefit of the multi-TSV case over the single-TSV case remains consistent regardless of the CMAX value.
CONCLUSIONS
In this paper, we studied how to construct a clock tree for 3D stacked ICs so that each individual die can be tested before bonding. Our solution utilizes multiple TSVs for wirelength and clock power saving, which in turn necessitate new circuit elements, namely, TSV-buffers, redundant tree, and transmission gates, to support low-skew and low-power testing and operations. We also studied the impact of buffer insertion on slew rate in 3D stacked ICs clocking. SPICE results show that our method based on multiple TSVs significantly reduces the wirelength and power of the 3D clock tree from a baseline approach that uses only one TSV.
