ABSTRACT
INTRODUCTION
Compared to the conventional two dimensional (2D) integration with one active device layer, the three dimensional (3D) integration with multiple active layers is effective to increase integration level and improve performance [1] [2] [3] . However, 3D integration also creates such challenges as power and thermal integrity, more difficult to deal with than those in 2D integration. Fig. 1 illustrates a typical 3D stacking of multiple active layers inside a single package. Power is supplied from the bottom, the power and ground planes in the package. C4 bumps connect the power and ground planes to the active device layers, and through-vias that could be called as "power/ground vias" are used to carry power supply between active layers. Due to the strong electromagnetic coupling between the package and the power delivery system (PDS) [4] , it is a critical to optimize power integrity in 3D ICs [5] . As shown by the PDS designs in 2D ICs [6] [7] [8] , stapling power/ground vias reduces the loop inductance of power/ground planes, and hence reduces the SSN (simultaneously switching noise) in package. However, there is no in-depth study on design automation of PDS in 3D ICs.
Moreover, because of the increased power density, the heat dissipation is extremely important in 3D ICs [1] . It is well known that excessively high temperature can significantly degrade interconnect/device reliability and performance [9] [10] [11] . Since vias are good thermal conductors as well, adding through-vias as "thermal vias" between device layers is effective to remove heat [12, 13] . A heat-sink is needed when the chip power is beyond 25Watt. It is usually placed on the top of device layers and serves as the primary heat-removal path. As shown in Fig. 1 , thermal vias can be inserted to more effectively remove heat from bottom device layers to the heat sink.
We call power/ground and thermal vias as non-signal vias. The number of these vias could be very large. For example, more than 10 4 thermal vias were used in [12, 13] . Because large numbers of non-signal vias introduce congestion for signal vias, planning non-signal vias in 3D ICs becomes a need. Existing via-stapling [6-8, 12, 13] implicitly optimizes power and thermal integrity separately, where power/ground vias are inserted to satisfy power integrity constraints, and thermal vias are inserted to satisfy thermal integrity constraints. These vias are stapled according to the distribution of maximum temperature and voltage violations. Because maximum voltage violations are often located differently from maximum temperature violations, the resulting vias can have quite different patterns. Fig. 2 shows the spatial distribution of the normalized temperature and voltage violations for a typical 3D design before stapling vias. Since the combined power/ground planes work as a cavity resonator, large voltage violation can be found often in the center of the planes [4, 6] . On the other hand, the thermal hotspots are those regions close to the heating sources on the device layer and they may spread more uniformly for a thermal aware 3D placement and routing [12, 13] . As a result, it leads to two different via-stapling patterns: the vias tend to be stapled in the center for the power integrity but stapled uniformly across the plane for the thermal integrity. Because stapling vias in such a sequential fashion ignores that non-signal vias could be used to minimize both the temperature and voltage violation, it may result in an over-design. Furthermore, to obtain a valid solution, the existing thermal-via planning [12, 13] assumes a steady-state thermal analysis with the maximum thermal-power as inputs. Since it is rare if not impossible for different regions to simultaneously reach their maximum thermal-power, the assumption of steady-state analysis may also lead to excessive numbers of vias. Therefore, it urges us to provide a stapling method using transient analysis to find the minimum number of non-signal vias, such that both power integrity in P/G planes and thermal integrity at device layers can be satisfied.
In this paper, we formulate and solve the 3D via-stapling problem to minimize non-signal vias subject to power and thermal integrity simultaneously. We apply transient models for temperature and supply voltage noise. As shown by experiments, for a sequential power and thermal optimization, using transient analysis reduces stapled non-signal vias by an average of 11.5% compared to using steady-state analysis. Moreover, our simultaneous optimization of power and thermal integrity by transient analysis reduces on average 34% non-signal vias compared to the sequential optimization by steady-state analysis.
The rest of the paper is organized as follows. We discuss the model and formulate the level based via-stapling problem in Section 2. We introduce a fast integrity analysis in Section 3, and develop an efficient algorithm in Section 4. We present experiments in Section 5 and conclude the paper in Section 6.
MODELING AND PROBLEM FORMULATION

Distributed Circuit Model
We model the 3D design by two parts: distributed thermal-RC circuit for thermal integrity and electrical-RLC circuit for power integrity. There is a well-known duality between electrical and thermal systems (See Table 1 ). As temperature is analogous to voltage, the heat flow can be modeled by a current passing through a pair of thermal resistance and capacitance driven by the current source, which in turn models the power dissipation.
The 3D layout of each silicon device layer and power/ground plane can be uniformly discretized by the finite difference method. As shown in Fig. 3 , each discretized tile can be represented by an RC-cell and an RLC-cell to construct distributed thermal-RC and electrical-RLC circuits, respectively. The via or C4 bump is modeled by a lumped RC pair for both thermal and electrical circuits. Moreover, we use partial inductance [14] to consider the magnetic coupling in the electrical-RLC circuit.
Thermal Analysis
According to [15] , a transient thermal-power is the running average of the cycle-accurate (often in the range of ns) power over the thermal time constant (often in the range of ms). When working loads are known, a constant maximum thermal-power can be calculated as the maximum of the transient thermal-power, and should be calculated for each region of the chip. Fig. 4 illustrates differences of these power definitions.
Assuming steady-state thermal analysis (based on thermal resistance model), thermal-via allocation has been studied for the placement [12] and routing [13] . Because the steady-state analysis ignores the temporal and spatial variations of the transient thermal-power, the methods in [12, 13] have to assume the maximum thermal-power simultaneously for all tiles in the integrated circuit to obtain a solution without thermal violation. Because it is rare if not impossible for different tiles to simultaneously reach their maximum thermal-power, the methods in [12, 13] may lead to excessive number of thermal vias. In addition, they directly solve the matrix-formed state equation. It can not efficiently calculate the nominal temperature and its sensitivity with respect to the via density for large scale designs. The design procedure is either based on iterations [12] , or based on an approximated square-root relation [13] between temperature and thermal-vias. It may not converge or may lead to inaccurate results. Therefore, accurate and efficient solutions to transient temperature and temperature sensitivity should be developed and are discussed in Section 3.
Level based Via-Stapling Problem
During the early planning stage, vias are vertically stapled between each pair of aligned tiles in adjacent active device layers. The stapling may have different patterns, which can be described by levels: see, the via-pattern becomes more uniform as the level increases. Multiple levels can be used simultaneously for a design, and we assume that the maximum level number K is provided by users. Considering total K levels of via-stapling patterns, our design freedom is the via-density (number of vias in one tile) for each level. If we define a via-density vector
each level is then associated with a via density D i to be decided during the via-stapling. Accordingly we have the following problem formulation:
. Given the allowed maximum voltage violation Vmax in P/G planes and the allowed maximum temperature Tmax in device layers, the via-stapling problem is to minimize the total via number, such that the temperature is smaller than Tmax and the voltage violation is smaller than Vmax.
This simultaneous power and thermal integrity driven problem can be represented by
where n i is the number of tiles to be stapled in level i. Dmax is decided by the signal-via routing congestion and D min is decided by the current density (reliability). The key to solving (2) is an efficient yet accurate transient integrity analysis. Such an analysis is presented in Section 3.
INTEGRITY ANALYSIS 3.1 Parameterized Description
Because the distributed thermal-RC circuit is a simpler case for the electrical-RLC circuit, we only present the RLC circuit for modeling and algorithm description unless stated otherwise.
Note that the via density D i at one tile is related to the via area A i in the tile by
. a is the unit area of via determined by the processing technology. Because conductance and capacitance values are all proportional to the area A, they are implicitly proportional to the via density D as well. Therefore, the electrical-RLC circuit with the parameterized number of vias can be formulated in MNA (modified nodal analysis) in frequency (s) domain: Vias are stapled between each pair of adjacent layers.
Parameterized via-density 
All notations in (4) are summarized in Table 2 . Note that B is the adjacent matrix composed by E i . It describes p i inputs and po critical nodes to be probed, both provided by users.
To mathematically describe adding vias in a circuit equation, an insertion (adjacent) matrix X is introduced. For a level-i stapling, adding vias between tiles m and n results in:
where
Then, the unit conductance and capacitance matrices for the level-i stapling are:
where g and c are conductance and capacitance for the via with unit area a. The vias are then added to the nominal G 0 and C 0 as a perturbed adjustment:
where G and C are the adjusted state matrices including the perturbation from vias.
Note that there are the following differences between thermal-RC and electrical-RLC circuits in MNA. For a thermal-RC circuit, (4) becomes G 0 = G, C 0 = C (6) where G and C has larger RC values and results in a larger timeconstant than an electrical-RLC circuit does. In addition, the input I(s) for the thermal-RC circuit stands for thermal-power. In contrast, I(s) stands for switching-current for electrical-RLC circuits. Moreover, output y at the selected critical nodes is temperature T and voltage V for the thermal-RC and electrical-RLC circuit, respectively.
Macromodel by Moment Matching
It is inefficient to directly solve (3) for large scale designs. Macromodeling technique based on moment matching can obtain compact models for large distributed thermal-RC and electrical-RLC circuits. We first review the existing flat reduction in this subsection, and then introduce a structured and parameterized reduction in Section 3.3.
To build macromodel by moment matching, we first define moment generation matrices (expanded at s 0 ) as
We then obtain the following projection matrix Q that spans the qth block Krylov subspace by
which can be constructed by a block Arnoldi method [16] . As a result, a reduced system can be obtained by projection:
Note thatĤ accurately approximate the original H by matching the first q block moments expanded at s 0 [17] . This procedure can be applied for both thermal-RC and electrical-RLC circuits. The above macromodeling, however, can only generate nominal values. By expanding (3) at design parameter points such as the via density at different levels, parametrized moments, i.e., the sensitivities of temperature and voltage with respect to the via density can be obtained. Because the parameterized moments have coupled frequency and parameter variables [18] , their dimensions grow exponentially and they may not be used in practice. This is improved in [19] by separately expanding moments of design parameters from those of frequency. However, [18, 19] apply a flat projection during the reduction and destroy the matrix structure. As a result, the nominal values and sensitivities after reduction can not be separated easily.
Structured and Parameterized Reduction
In this paper, we show that nominal values and their sensitivities can be obtained separately by a structured and parameterized reduction. Because the sensitivity is large with respect to the frequency change but small with respect to the design parameter perturbation, the state variable x(D, s) can be approximated by Taylor expansion:
This is similar to the method in [19] to deal with process variations. Substitute (7) in (3), and explicitly match the moment for each D i to the first-order, (3) can be reformulated into an augmented parameterized state equation:
with Gap = 2 6 6 6 4
and
Note that Cap has the same lower-triangular structure as Gap does. The state variable yap at output for those critical tiles can be also divided into two parts:
1 , ..., y
As a result, solving (8) results in the nominal value, y (0) , and its first-order sensitivity y (1) with respect to each parameter D i . The large system equation (8) can be reduced using projection with preserved moments (of s) up to the qth-order. Because Gap and Cap have lower-triangular structures, a flat projection matrix Qap can be constructed recursively using an iterative Arnoldi method [19] . However, [19] directly projects (8) by Qap. It leads to a reduced macromodel losing the block structure for both state matrices and variables. As a result, y (0) and y (1) are interleaved with each other.
In this paper, instead of using the flat projection matrix Qap
we introduce a structured projection matrix Qap = 2 6 6 6 4
by partitioning Qap according to the dimension of x (0) and x (1) . Note that Gap, Cap and Qap have larger dimensions than G, C and Q. However, as Gap, Cap and Qap are in block triangular or block diagonal form, they can be implemented in a block-matrix data structure [20, 21] without memory usage increase. More importantly, as shown in [20, 21] , the projection by Qap preserves the block matrix structure. As a result, the orderreduced state matrices after projection by Qap are 
Note that the reduced e Cap has the same preserved lower-triangular structure as e Gap. In addition, the structured and parameterized macromodel where
0 , e y (1) 1 , ..., e y
The overall voltage/temperature (V /T ) vector at those critical tiles perturbed by the via-density vector, i.e. D, is
Since our reduction preserves the block structure, the reduced nominal value e y (0) and first-order sensitivity e y (1) at output (critical tiles) can be solved independently. Moreover, because the reduced system still has the lower-triangular structure, it is obvious that (13) can be efficiently solved using back substitution with only one LU factorization of
As a result, such a structured and parameterized macromodel can be incorporated in a sensitivity based optimization for efficient yet accurate staple vias.
SENSITIVITY BASED OPTIMIZATION
Because our structured and parametrized macromodel provides both nominal values and sensitivities, they can be incorporated in any gradient-based optimization. However, the Hessian matrix used in gradient-based optimization [22] is computationally expensive to obtain second-order gradients. If there are K parameters in the design, it needs K 2 parameterized second-order moments to generate a Hessian matrix. As a result, the cost to build and simulate a macromodel becomes huge. This is inefficient and not necessary for the via-planning during the early-design stage.
For the sake of speed to handle large scale problems, the technique used in our paper is a sensitivity-based heuristic similar to the TILOS [23] algorithm. By successively increasing via density for the via-level with the largest gain in each iteration, our algorithm staples a minimum number of vias to reduce both temperature and voltage violations in problem formulation (2) . This greedy heuristic optimizer can solve large scale designs efficiently and effectively.
The overall optimization to solve problem formulation (2) is outlined in Algorithm 1. Note that the weighted sensitivity vector S is a weighted-sum of normalized voltage-sensitivity vector S V and thermal-sensitivity S T :
where α and β are weights for S T and S V . The via density vector D is updated by where γ is an adaptive-controlled step size and decreases as the iteration proceeds.
Algorithm 1 Via Stapling Procedure
Input: Critical nodes, via pattern number K, signal congestion induced bound Dmax and current-density induced bound D min
Step 0: Reduce (8) using structured and parameterized reduction;
Step 1: Compute nominal voltage(V )/temperature(T ) and sensitivity S V /S T using (13) by backward-substitution;
Step 2: Determine Vmax and Tmax of critical nodes;
Step 3: Increase the via density D according to weighted sensitivity S in the range of (D min , Dmax);
Step 4: Update the structured and parameterized macromodel according to (12);
Step 5: Repeat from Step 1 until maximum noise/temperature constraints are satisfied; Output: Via density vector D Because our macromodel is parameterized, only one reduction is needed and the reduced state matrices can be repeatedly used when updating the parameter vector D according to (12) . In addition, the reduced model is much smaller than the original one and has a lower-triangular structure. Its nominal value and sensitivities, therefore, can be efficiently solved by backwardsubstitution of (13) with only one LU factorization. As a result, the optimization procedure in Algorithm 1 is computationally efficient.
EXPERIMENTS
The proposed modeling and algorithm have been implemented in C. Experiments are run on a Sun-Fire-V250 workstation with 2G RAM, and the reported number of vias are all for no-signal vias. Silicon, copper and dielectric are assumed for via, heatsink, active device layer, inter-layer and PG plane, respectively. Table 3 and Table 4 summarize their electrical and thermal constants and dimensions. In addition, 20% of the deive-layer tiles have a random input of thermal-power in the range of 1 to 5 × 10 6 W/m 2 . Their clock gating pattern has a period of 100ms where the power in the standby mode is 20% of the running mode. 10% of power/ground-plane tiles have random inputs of switching-currents in the range of 1 to 5 × 10 −1 A with risingtime 0.01ns. All power sources are uniformly distributed across the active device layer or PG-plane. The range of via density is set from 100 to 80000 for each level, and the weight α and β are equal for (15) . A modest 3D stacking with 1-heat-sink, 2-device-layer, 2-dielectric-layer and 2-P/G-plane is assumed for the experiments.
Verification of Macromodel and Optimization
The first example is a 3D stacking with 7900 total tiles and a level-4 via pattern. Fig. 6 compares the time-domain waveform between our macromodel and exact MNA solution at ports 1 and 5 of one selected P/G plane. The macromodel (expanded at s 0 = 0) uses order 40. Clearly, our macromodel is accurate. Fig. 7 presents the decreasing of the transient temperature and voltage violation during the simultaneous optimization procedure. Fig. 8 further shows the steady-state temperature map across the top device layer. In this example, we assume that all thermalpower sources are located at one side of the device layer. The initial chip temperature at the top layer is around 150 • C, and its temperature profile at steady-state is shown in Fig. 8 (a) . In contrast, the via-stapling results in a cooler temperature that closely approaches the targeted temperature as shown in Fig. 8 (b).
Comparison between Steady-state and Transient Thermal Analysis
We further compare the runtime and the number of vias between the steady-state and transient thermal analysis. We increase the circuit complexity by increasing the number of discretized tiles, and need more levels for vias when the tile number becomes larger. The sequential optimization in this comparison is used. We first allocate vias to satisfy the power integrity con- straints with targeted voltage violation Vmax of 0.2V , and then allocate vias to satisfy the thermal integrity constraints with targeted temperature Tmax of 52 • C and considering the already stapled "power/ground" vias for heat removal. Table 5 presents the results. The vias are over-designed when using steady-state analysis. Compared to the optimization by steady-state analysis, the optimization by transient thermal analysis reduces vias by 11.5% for a circuit with 27740 tiles. This is because the steady-state analysis has to assume a constant maximum thermal-power input for all tiles in order to get a valid solution. In reality, it seldom happens that all tiles have their maximum input at the the same time. In contrast, our transient thermal analysis can accurately generate the transient temperature using the input of transient thermal-power.
Furthermore, directly solving state equations as in [12, 13] results in longer runtimes. In contrast, the macromodel can efficiently match the transient response using around 20 moments. For the same circuit with 27740 tiles, our macromodel has a 155X smaller runtime compared to the steady-state analysis, and the steady-state analysis can not complete the largest example.
Comparison between Sequential and Simultaneous Optimizations
We further compare the sequential optimization with the simultaneous thermal/power optimization, and first discuss via patterns for thermal and power integrity, respectively. As shown in Table 6 , for a circuit with 27740 tiles, when only using the thermal-constraint, more vias tend to be stapled for high-level patterns. As a higher level pattern means more uniform via distribution, the thermal constraint results in a more uniformly distributed via pattern. On the other hand, when only using P/Gconstraint, more vias are stapled in the center, i.e., using level-0 via pattern to reduce the power/ground plane loop inductance or SSN. Due to such an opposite stapling trend, a via-stapling in a sequential fashion results in excessive number of vias. In contrast, the vias are distributed more uniformly in all levels when simultaneously considering the thermal and power integrity. Finally, we compare the results using simultaneous optimization and sequential optimization. On average, our simultaneous optimization further reduces 34% vias compared to the the sequential optimization by steady-state analysis in Table 5 , and reduces 22.5% vias compared to the sequential optimization with transient analysis.
CONCLUSIONS
Inter-layer non-signal vias are effective to reduce power supply noise and temperature hotspots in 3D ICs. Existing work on via-stapling [6-8, 12, 13] does not consider thermal and power integrity simultaneously, and uses steady-state thermal analysis. To reduce the number of vias for targeted power and thermal integrity, this paper has presented the first in-depth study on simultaneous power and thermal integrity driven via-stapling.
Our simultaneous power and thermal integrity optimization minimizes non-signal vias subject to constraints on transient temperature and voltage violations, which are calculated by a structured and parameterized model reduction. This model reduction also generates parameterized sensitivities of temperature and voltage violations with respect to the via pattern and density. The resulting macromodel is used in the efficient greedy optimization simultaneously driven by power and thermal integrity.
Experiments with two active device layers show that for the sequential power and thermal optimization, using the transient thermal analysis reduces non-signal vias by on average 11.5% compared to using the steady-state thermal analysis. In addition, our simultaneous optimization of power and thermal integrity reduces on average 34% non-signal vias compared to the existing approach assuming the sequential optimization and steady-state thermal analysis. The via reduction could be higher for the 3D design with more device layers.
The power integrity in this paper considers noise on power/ground planes in the package without considering on-chip power supply routing. In the future, we will develop simultaneous power and signal routing to optimize on-chip power and thermal integrity. 
