Abstract-Thermal simulation has become increasingly important in chip design, especially in the nanometer regime, where the onchip hot spots severely degrade the performance and reliability of the circuit and increase the leakage power. In this paper, we present a highly efficient and accurate thermal simulation algorithm that is capable of performing full-chip temperature calculations at the cell level. The algorithm is a combination of several important numerical techniques including the Green function method, the discrete cosine transform (DCT), and the frequency domain computations. Experimental results show that our algorithm can achieve orders of magnitude speedup compared with previous Green function based algorithms while maintaining the same accuracy.
I. INTRODUCTION
Thermal simulation algorithms in chip design can be roughly divided into two categories, based on whether the meshing of the entire substrate is necessary during the simulation process. The generic thermal simulation algorithms such as the finite difference method (FDM) and the finite element method (FEM) used in [1] and [2] enjoy the advantages of high flexibility in handling different kinds of boundary conditions in thermal problems and the capability of achieving high accuracy. However, their requirement of meshing the entire substrate and later solving a large system of linear equations makes them relatively inefficient. In [3] , the thermal-ADI algorithm was proposed to efficiently solve the transient thermal problems using a meshing scheme similar to that used in the FDM. However, for steady-state analysis, the ADI algorithm can also become slow if the initial guess of the temperature distribution is far from the final solution.
The boundary element method (BEM) constitutes another class of thermal simulation algorithms in which the volume meshing of the substrate is completely avoided. An important underlying concept in the BEM is the Green function, which describes the temperature distribution in the chip when a unit point power source is present. In [4] and [5] , the analytical forms of the Green function were derived by assuming that the chip was infinitely large horizontally. One significant advantage of the analytical forms of the Green function is that they are very inexpensive to evaluate and hence can be easily incorporated into optimization procedures where the Green function needs to be evaluated many times. However, by assuming that the chip is infinitely large horizontally, the derived Green functions tend to severely underestimate the temperature, although they can correctly identify the locations of the hot spots as shown in [4] . In [6] , a Green function that is suitable for the rectangular-shaped chip geometry was presented, and look-up tables were established based on the Green function to assist the efficient evaluation of the temperature field. Nevertheless, the cost of this algorithm can become prohibitive for cell level full-chip thermal simulations where both the number of heat sources and the number of field regions are large.
In [7] , an efficient thermal simulation algorithm based on the solution of the finite difference equations using the multigrid approach was proposed, and it has the capability of performing the full-chip thermal analysis. In this paper, we present another highly efficient and accurate full-chip thermal simulation algorithm that is based on the Green function method, the discrete cosine transform (DCT), and the frequency domain computations. Since the temperature field can be obtained by convolving the power distribution with the underlying Green function, using the frequency domain computations in conjunction with the DCT will result in a significant improvement in efficiency, as can be achieved in many signal processing works where the time or space domain convolution is replaced by the frequency domain analysis. In the substrate parasitic extraction work [8] , the frequency domain analysis was performed under the framework of functional eigen-decomposition. Our algorithm takes a piece-wise constant power density map as the input and generates a piece-wise constant temperature map as the output. The primary steps of the algorithm include 1) Obtaining the frequency domain representation of the power density map using the 2D DCT.
This work was supported in part by DARPA under grant N66001-04-1-8909 and NSF under award CCR-0205227.
The authors thank the University of Minnesota Supercomputing Institute for providing the computing facilities.
2) Calculating the frequency domain representation of the temperature map by multiplying each frequency component of the power density map by the corresponding frequency response of the linear system determined by the Green function. 3) Using a 2D inverse discrete cosine transform (IDCT) to obtain the temperature map from its frequency domain representation. Both the 2D DCT and the 2D IDCT can be calculated efficiently using the 2D fast Fourier transform (FFT) in O(Ngc×log(Ngc)) time, where Ngc is the total number of grid cells in the power density map, which is also the total number of grid cells in the resulting temperature map. This is a significant improvement over the O(N 2 gc ) time complexity of the algorithm presented in [6] . Experimental results show that the algorithm proposed in this paper can achieve orders of magnitude speedup over the algorithm in [6] , while still maintaining the same accuracy.
Heat spreader Heat sink Fig. 1(a) shows an integrated circuit chip with the associated packaging, and Fig. 1(b) shows a schematic of the structure in Fig. 1(a) where the packaging including the heat spreader and the heat sink has been simplified, but the multilayered structure of the chip is explicitly shown. The steady state temperature distribution inside the chip are governed by Poisson's equation
where r = (x, y, z), T (r) is the temperature (°C) distribution inside the chip, g(r) is the volume power density (W/m 3 ), and k l(r) is the thermal conductivity (W/(m·°C)) of the layer where point r is located [9] . The vertical surfaces and the top surface of the chip are assumed to be adiabatic [10] , and the bottom surface of the chip is assumed to be convective, with an effective heat transfer coefficient h (W/(m 2 ·°C)) [11] . In mathematical form, these boundary conditions can be expressed as
where Ta is the ambient temperature, and kN is the thermal conductivity of the bottom layer of the chip. In addition, we enforce the continuity conditions at the interface between adjacent layers within the multilayered chip, i.e.,
where is an infinitesimally small quantity and ki is the thermal conductivity of the i th material layer in the multilayered chip structure.
III. FULL-CHIP THERMAL SIMULATION ALGORITHM A. The Green function of the rectangular-shaped multilayered structure
Let G(r, r ), with r = (x, y, z) and r = (x , y , z ), be the distribution of temperature above Ta in the multilayer when a unit point power source of 1W is placed at position r . Then G(r, r ) satisfies the equation
and the boundary conditions
is the three-dimensional Dirac delta function, and G(r, r ) is the Green function. The temperature field under an arbitrary power density distribution can be obtained easily as
Using a derivation similar to that presented in [12] , the Green function can be written in the form
where Z mn (z, z ) s are functions of only the z coordinates of the source and field points.
B. Full-chip thermal simulation algorithm
In the following analysis, we assume that both the heat sources and the field regions are located on discrete horizontal planes. Since the vertical dimensions of the devices are much smaller than that of the silicon chip, this assumption is reasonable for most practical purposes. For a particular pair of source and field planes, i.e., for a particular z and z , the Green function can be written as
The temperature distribution on the field plane due to the heat sources on the source plane is given by
where P d (x , y ) is the power density distribution on the source plane. The convolution integral in (16) can be considered as the governing equation of a linear system determined by the Green function G(x, y, x , y ).
As stated previously, the first step of our algorithm is to obtain the frequency domain representation of the power density map in the form
where
It is easy to show that φij(x, y) satisfies the equation
where λij = 8 > < > :
is the response of the linear system to the frequency component φij (x, y) [8] . After the frequency domain representation of the power density distribution in the source plane is obtained, the temperature distribution in the field plane can be calculated easily by
As will be shown next, both the frequency decomposition in (17) and the double-summation in (21) can be calculated efficiently using the DCT and IDCT through the FFT. Now we assume that both the source plane and the field plane are divided into M ×N rectangular grid cells of equal size as shown in Fig. 2 , and the power density in each grid cell on the source plane is uniform, i.e., the power density distribution can be written in the piece-wise constant form
where . Pmn is the power density of the mn th grid cell. Substituting (22) into (17) and using the orthogonality property of the cosine functions in the integral sense, we obtain
Note that to accurately represent the power density distribution P d (x , y ) using (17), the theoretical upper limit of the double summation should be infinity. In practical implementations, however, the summation must be truncated to ensure a reasonable runtime. Since (17) is essentially the Fourier expansion of P d (x , y ), a natural criterion for determining the truncation point is that enough "energy" contained in P d (x , y ) is covered by the truncated Fourier expansion. Mathematically, we have 
Substituting (22) into the left hand side of (26), we obtain
which can be considered as a form of the Parseval's theorem. The truncation points M and N are then determined by
where η is the proportion of the "energy" of the space domain signal P d (x , y ) that must be covered by the truncated Fourier expansion. In practice, we found that setting η to 90% will usually be enough to obtain very accurate results in temperature calculations. Note that for 0≤i<M and 0≤j<N, the double summation in (24) can be considered as a term in the 2D type-II DCT [13] of the power density matrix P . For i≥M or j≥N , we can always find integers s1 and s2 such that i = 2s1M ±î and j = 2s2N ±ĵ where 0≤î<M and 0≤ĵ<N. Hence, for any i and j, we always have
with 0≤î<M and 0≤ĵ<N is the 2D type-II DCT of the P matrix and the sign of (30) is determined by whether s1 and s2 are even or odd numbers [8] . Equation (31) can be calculated efficiently using the 2D FFT in O((M ·N )×log(M ·N )) time. After the 2D DCT matrix e P is obtained, the calculation of aij simply involves computing the coefficient Aij and finding the corresponding term e Pîĵ . From (18), (21), and (29), the temperature distribution T (x, y) can now be written as
and the average temperature of the mn th grid cell can be obtained by
As stated previously, any i≥M and j≥N can be written as i = 2s1M ±î and j = 2s2N ±ĵ such that 0≤î<M , 0≤ĵ<N and s1,s2 are integers. Using the periodicity of the cosine function, we can finally cast Tmn into the form
where Lîĵ = Kîĵ e Pîĵ and Kîĵ can be calculated as follows
2) ifî =0,ĵ = 0
3) ifî = 0,ĵ =0
After the coefficients K îĵ s are calculated, the matrix L can be easily obtained by point-wise multiplying the matrices K and e P . The double summation in (35) can then be calculated efficiently using the 2D IDCT.
Input:
• Chip geometry and physical properties of the material layers.
• Power density map -matrix P . Output: Temperature distribution map -matrix T .
Algorithm: 1) Calculate the Green function coefficients Cij s;
2) Calculate the frequency responses of the system λij s; 3) Calculate the type-II 2D DCT of the power density matrix e P = 2DDCT(P );
Update ASE; end while; 6) Calculate the matrix K; 7) Calculate the matrix L with Lîĵ = Kîĵ e Pîĵ ; 8) Calculate the temperature distribution map using the type-II 2D IDCT T = Ta + 2DIDCT(L); Fig. 3 . Thermal simulation algorithm using the Green function method, the DCT, and the frequency domain computations.
The complete thermal simulation algorithm using the Green function method, the DCT, and the frequency domain computations is shown in Fig. 3 . The asymptotic time complexity of the algorithm is O(Ngc × log(Ngc)) where Ngc = M ·N is the total number of grid cells. This is a significant improvement over the O(N 2 gc ) complexity of the algorithm given in [6] . Note that up to now, we have focused on the effect of one source layer on the temperature distribution of the field layer. When multiple source layers are present, such as in the emerging 3D IC technology, their effects can be calculated individually and summed up. The ambient temperature Ta should only appear once in the final summation. Because the method presented in [6] is accurate except for the very small truncation error of the Green function, we use the result in [6] as the benchmark to test the accuracy and efficiency of our algorithm. Fig. 4 shows the power distribution of an example chip and the calculated temperature map using the algorithm proposed in this paper. The heat sources are assumed to be located on the top surface of the chip and the size of the temperature map is 64×64. The chip has dimensions of 3.3mm×3.3mm×0.5mm. The thermal conductivity of silicon is 148W/(m·°C), and the bottom surface of the chip has an effective heat transfer coefficient of 8700W/(m 2 ·°C). We require our algorithm to achieve a similar accuracy as that in [6] by choosing η in (29) appropriately, i.e., within 1% error compared with the results from commercial computational fluid dynamic softwares, and the runtimes of the two algorithms are compared. Each runtime is divided into two parts, i.e., the time spent on the steps that are independent of the input power density matrix and hence can be pre-calculated outside the optimization loop in thermal-aware designs, and the time spent on the steps that depend on the input power density matrix and hence must be executed within the optimization loop. For both algorithms, the Green function coefficients Cij s can always be pre-calculated and stored. In addition, the look-up tables in the algorithm in [6] can be pre-calculated while the frequency responses λij s in our algorithm can be pre-calculated. For the precalculated steps, the runtime is dominated by the computation of the coefficients Cij s in both algorithms, which may take about 95sec to obtain a 2048×2048 C matrix. However, since these steps only need to be executed once for each die geometry and then used many times in later thermal simulations, the amortized cost is usually extremely small and we will ignore this part of the runtime in further analysis. Experimental results show that for the steps that are not pre-calculated, the runtime of our algorithm using MATLAB is 0.09sec while that of the algorithm in [6] is 0.53sec. Note that the runtime of the algorithm in [6] is linear with respect to the number of heat sources and there are only 14 heat sources in the example given here. For cell level full-chip simulations where the number of heat sources is significantly larger, the advantages of our algorithm will become even more obvious. In this subsection, we show an example of cell level full-chip thermal simulations. We consider a chip with dimensions of 1cm×1cm×0.5mm and the same physical properties as the chip used in the previous example. There are 1024×1024 square grid cells of equal size located on the top surface of the chip and a 1024×1024 temperature distribution map of the cell layer is calculated. Fig. 5 shows the input power density map and the resulting temperature map. The time it takes for MATLAB to obtain this temperature map containing 1.05M grid cells is only 6.4sec, excluding the time for the pre-calculations, while the runtime of the algorithm in [6] becomes intractable.
IV. EXPERIMENTAL RESULTS

A. Accuracy and Efficiency of the Algorithm
B. Cell Level Full-Chip Thermal Simulation
V. DISCUSSIONS -STRATEGIES FOR PERFORMING THE THERMAL SIMULATIONS WITH LOCAL HIGH ACCURACY REQUIREMENTS
In real design environments, the accuracy requirements on the thermal simulation frequently differ from place to place on the same chip. For example, in mixed signal designs where the analog circuits are fabricated on the same chip as the digital circuits, the analog blocks often have more stringent accuracy requirements on the thermal simulation because the operations of the analog circuits are more sensitive to temperature. For these kinds of problems, a better strategy can be adopted to accelerate the runtime of the algorithm further. The key idea is to use a coarse grid to divide the source and field layers such that each grid cell can contain several logic gates or analog functional units. The power density of each grid cell is calculated by summing up the power dissipations of all the logic gates and analog functional units located in it and divide the sum by the area of the grid cell. A coarse temperature map is then obtained from the coarse power density map using the algorithm presented in section III and is used for the digital blocks on the chip. Note that we must ensure that the coarse grid is fine enough for the digital blocks but they may not achieve the accuracy requirements of the analog blocks. 6 shows a chip that is divided into M ×N coarse grid cells each of which contains several logic gates or analog functional units, and let the shaded area represent the analog block. An M ×N temperature map is first obtained. The inaccuracies in the temperature calculations, besides that due to the truncation of the frequency domain representation of the power density map, will come from two sources which include
• Assuming that the power density in each grid cell is uniform.
• Only the average temperature of each grid cell is calculated, i.e., all of the logic gates and analog functional units inside the same grid cell obtain the same calculated temperature. Now assume that we need to calculate the temperature of the analog functional unit located in the ij th grid cell and represented by the black rectangle more accurately. Let Tij and T ij,kl be the average temperature of the ij th grid cell and the contribution of the average power density of the kl th grid cell to the average temperature of the ij th grid cell in the coarse grid temperature calculations respectively, and let Ta.c. represent the more accurate average temperature of the analog functional unit. We divide the grid cells into two categories, i.e., those with close interactions with the ij th grid cell (denoted by CIij) and those without close interactions with the ij th grid cell. The effects of the logic gates and analog functional units that are contained in the grid cells belonging to CIij are re-calculated for increased accuracy. For example, we can put the ij th grid cell and all the grid cells surrounding it into the first category and all the other grid cells into the second category. If higher accuracy is required, then more grid cells should be put into the first category. The temperature Ta.c. can then be calculated using 
VI. CONCLUSIONS
In this paper, we presented a cell level full-chip thermal simulation algorithm that is a combination of the Green function method, the DCT, and the frequency domain computations. Experimental results show that our algorithm can achieve orders of magnitude speedup compared with previous Green function based thermal simulation algorithms while still maintain the same accuracy. The simulation of a chip containing 1.05M grid cells only takes about 6.4sec after the pre-calculations have been performed. In addition, the strategies that can be used for the problems that have local high accuracy requirements on temperature calculations are also discussed.
