Abstract-Ever-increasing integrated circuit (IC) power densities and peak temperatures threaten reliability, performance, and economical cooling. To address these challenges, thermal analysis must be embedded within IC synthesis. However, this requires accurate three-dimensional chip-package heat flow analysis. This has typically been based on numerical methods that are too computationally intensive for numerous repeated applications during synthesis or design. Thermal analysis techniques must be both accurate and fast for use in IC synthesis. This paper presents a novel accurate incremental spatially and temporally adaptive chip-package thermal analysis technique called ISAC for use in IC synthesis and design. It is common for IC temperature variation to strongly depend on position and time. ISAC dynamically adapts spatial-and temporal-modeling granularities to achieve high efficiency while maintaining accuracy. Both steady-state and dynamic thermal analyses are accelerated by the proposed heterogeneous spatial-resolution adaptation and asynchronous thermal-element time-marching techniques. Each technique enables orders-of-magnitude improvement in performance while preserving accuracy when compared with other state-of-the-art adaptive steady-state and dynamic IC thermal analysis techniques. Experimental results indicate that these improvements are sufficient to make accurate dynamic and steady-state thermal analysis practical within the inner loops of IC synthesis algorithms. ISAC has been validated against reliable commercial thermal analysis tools using industrial and academic synthesis test cases and chip designs. It has been implemented as a software package suitable for integration in IC synthesis and design flows and has been publicly released.
ISAC: Integrated Space-and-Time-Adaptive
Chip-Package Thermal Analysis I. INTRODUCTION W ITH increasing integrated circuit (IC) power densities and performance requirements, thermal issues have become critical in IC design [1] . If not properly addressed, increased IC temperature affects other design metrics including performance (via decreased transistor-switching speed resulting from decreased charge carrier mobility and increased interconnect latency), power and energy consumption (via inManuscript received February 3, 2006 . This work was supported in part by the Natural Sciences and Engineering Research Council under Discovery Grant 388694-01 and in part by the National Science Foundation under Award CNS-0347941. This paper was recommended by Associate Editor V. Narayanan.
Y. Yang, C. Zhu, and L. Shang are with the Department of Electrical and Computer Engineering, Queen's University, Kingston, ON K7L 3N6, Canada (e-mail: 4yy6@qlink.queensu.ca; 4cz1@qlink.queensu.ca; li.shang@ queensu.ca).
Z. Gu and R. P. Dick are with the Department of Electrical Engineering and Computer Science, Northwestern University, Evanston, IL 60208 USA (e-mail: zgu646@eecs.northwestern.edu; dickrp@eecs.northwestern.edu).
Digital Object Identifier 10.1109/TCAD.2006.882589 creased leakage power), reliability (via electromigration, thermal cycling, time-dependent dielectric breakdown, etc.), and price (via increased system cooling cost). It is thus critical to consider thermal issues during IC design and synthesis. When determining the impact of each decision in the synthesis or design process, the impacts of changed thermal profile on performance, power, price, and reliability must be considered. This requires repeated use of detailed chip-package thermal analysis. This analysis is generally based on computationally expensive numerical methods. In order to support IC synthesis, a thermal simulator must be capable of accurately analyzing models that contain tens of thousands of discrete elements. Moreover, the solver must be fast enough to support numerous evaluations in the inner loop of a synthesis flow. Reliance on nonadaptive matrix operations that increase in space and time complexity superlinearly with matrix size (and model element count) has made achieving both accuracy and speed elusive.
The IC thermal analysis problem may be separated into two subproblems: steady-state (or static) analysis and dynamic analysis. Steady-state analysis determines the temperature profile to which an IC converges as time approaches infinity, given power and thermal-conductivity profiles. Steady-state analysis is sufficient when an IC thermal profile converges well before subsequent changes to its power profile and when transient thermal profiles, which might indicate short-term temperature peaks, may be neglected. Dynamic thermal analysis determines the temperature profile of an IC at any time, given initial temperature, power, heat capacity, and thermal-conductivity profiles. Although more computationally intensive than steadystate thermal analysis, dynamic thermal analysis is necessary when an IC power profile varies before its thermal profile has converged or when transient features of the thermal profile are significant.
Thermal analysis has a long history. Traditionally, thermal issues were solely addressed during cooling and packaging design based on worst case analysis; in the past, thermal issues were typically ignored during IC design or transferred as power constraints, e.g., a predefined peak power budget. A number of industrial tools were developed and widely used by packaging designers, such as FLOMERICS [2] , ANSYS [3] , and COMSOL (formerly known as FEMLAB) [4] . Since thermal analysis was conducted only a few times during the design process, efficiency was not a major concern. Typically, it took minutes or hours to conduct each simulation. Due to increasing power density and cooling costs, such worst case based cooling design has become increasingly difficult, if not infeasible. Researchers started addressing thermal issues during 0278-0070/$25.00 © 2007 IEEE IC design, for which both the efficiency and accuracy of thermal analysis are critical. Recently, Skadron et al. developed a steady-state and dynamic thermal analysis tool called HotSpot for microarchitectural evaluation [5] . Therefore, only coarsegrained modeling is supported. In this tool, neither the matrix techniques of the steady-state analysis tool nor the lock-step fourth-order Runge-Kutta time-marching technique used for dynamic analysis makes use of spatial or asynchronous temporal adaptation; therefore, accuracy or performance suffers. Li et al. proposed a full-chip steady-state thermal analysis method [6] . In this paper, matrix operations are handled using the multigrid method, which can efficiently support fine modeling granularity with a large number of grid elements. However, although the advantages of heterogeneous element discretization are noted, in this paper, no systematic adaptation method is provided. Smy et al. proposed a quad-tree meshrefinement technique for thermal analysis [7] but did not consider local temporal adaptation. Zhan and Sapatnekar [8] proposed a steady-state thermal analysis method based on Green's function formalism that was accelerated by using discrete cosine transforms and a lookup table. However, these methods [6] - [8] do not support dynamic thermal analysis. Liu et al. proposed a moment-matching-based thermal analysis method that is suitable for accelerating thermal analysis of course-grained architectural models [9] . Numerical-analysis techniques were also proposed to characterize the thermal profile of on-chip interconnect layers [10] - [12] .
Existing IC thermal analysis tools are capable of providing either accuracy or speed but not both. Accurate thermal analysis requires expensive computation for many elements in some regions at some times. Non-adaptive IC thermal analysis techniques ensure accuracy by choosing uniformly fine levels of detail across time and space, i.e., they use equivalent physical sizes or time-step durations for all thermal elements. The large number of elements and time steps resulting from such techniques makes them computationally intensive and therefore impractical for use within IC synthesis. This paper presents validated synthesis-oriented IC thermal analysis techniques that differ from existing work by doing operation-byoperation dynamic adaptation of temporal and spatial resolution in order to dramatically reduce computational overhead without sacrificing accuracy. Experimental results indicate that the proposed spatial-adaptation technique improves CPU time by 21.64-690.00× and that the temporal-adaptation technique improves CPU time by 122.81-337.23×. Although much faster than conventional analysis techniques, the proposed techniques have been designed for accuracy even when this increases complexity and run time, e.g., by correctly modeling the dependence of thermal conductivity on temperature. These algorithms have been validated against COMSOL Multiphysics [4] , a reliable commercial finite-element physical-process modeling package and a high-resolution numerical spatially and temporally homogeneous simultaneous differential equation solver. Experimental results indicate that using existing thermal analysis techniques within the IC synthesis flow would increase CPU time by many orders of magnitude, making it impractical to synthesize complex ICs. The proposed techniques make accurate dynamic and static thermal analysis practical within the inner loops of IC synthesis algorithms. They have been implemented as a software tool called integrated space-andtime-adaptive chip-package thermal analysis technique (ISAC), which has been publicly released [13] .
This paper is organized as follows: Section II gives a motivating example that illustrates the need for fast and accurate thermal analysis during IC synthesis and suggests techniques to reach this goal. Section III describes the model, algorithms, and implementation of ISAC, the proposed steady-state and dynamic thermal analysis tool. Section IV presents experimental results that validate ISAC and demonstrate the dramatic performance advantages that result from spatial and temporal adaptations during thermal analysis. Section V presents the conclusions.
II. MOTIVATING EXAMPLES
In this section, we use a thermal-aware IC synthesis flow to demonstrate the challenges of fast and accurate IC thermal analysis. Fig. 1 shows an integrated behavioral-level and physical-level IC synthesis system [14] . This synthesis system uses a simulated annealing algorithm to jointly optimize several design metrics, including performance, area, power consumption, and peak IC temperature. It conducts both behavioral-level and physical-level stochastic optimization moves, including scheduling, voltage assignment, resource binding, floor planning, etc. An intermediate solution is generated after each optimization move. A detailed two-dimensional active-layer power profile is then reported based on the physical floorplan. Thermal analysis algorithms are invoked to guide optimization moves.
As illustrated by the synthesis-flow example, for each intermediate solution, detailed thermal characterization requires full chip-package thermal modeling and analysis using computationally intensive numerical methods. Figs. 2 and 3 show a full chip-package thermal-modeling example from an IBM IC design (see Section IV-A for more detail). The steady-state thermal profile of the active layer of the silicon die and the top layer of the cooling package, as shown in Fig. 3 , were characterized using a multigrid thermal solver by partitioning the chip and the cooling package into 131 072 homogeneous thermal elements. Without spatial and temporal adaptations, the solver requires many seconds or minutes when run on a highperformance workstation. Compared to steady-state thermal modeling, characterizing an IC dynamic thermal profile is even more time consuming. IC synthesis requires a large number of optimization steps; thermal modeling can easily become its performance bottleneck.
A key challenge in thermal-aware IC synthesis is the development of fast and accurate thermal analysis techniques. Fundamentally, IC thermal modeling is the simulation of heat transfer from heat producers (transistors and interconnects) through a silicon die and a cooling package to the ambient environment. This process is modeled with partial differential equations. In order to approximate the solutions of these equations using numerical methods, finite discretization is used, i.e., an IC model is decomposed into numerous three-dimensional (3-D) elements. Adjacent elements interact via heat diffusion. Each element is sufficiently small to permit its temperature to be expressed as a difference equation that is a function of time, its material characteristics, its power dissipation, and the temperatures of its neighboring elements.
In an approach analogous to electric circuit analysis, thermal RC (or R) networks are constructed to perform dynamic (or steady-state) thermal analysis. Direct matrix operations, e.g., inversion, may be used for steady-state thermal analysis. However, the computational demand of this technique hinders its use within synthesis. Dynamic thermal analysis may be conducted by partitioning the simulation period into small time steps. The local times of all elements are then advanced in lock-step using transient temperature approximations yielded by difference equations. The computational complexity of dynamic thermal analysis is a function of the number of grid elements and time steps. Therefore, to improve the efficiency of thermal modeling, the key issue is to optimize the spatial-and temporal-modeling granularity, eliminating nonessential elements and stages.
There is a tension between accuracy and efficiency when choosing modeling granularity. Increasing modeling granularity reduces analysis complexity but may also decrease accuracy. Uniform temperature is assumed within each thermal element; intraelement thermal gradients are neglected. Therefore, increasing spatial-modeling granularity naturally increases modeling errors. Similarly, increasing time-step duration may result in failure to capture transient thermal fluctuation or may increase truncation error when the actual temperature functions of some elements are of higher order than the difference equations used to approximate them.
IC thermal profiles contain significant spatial and temporal variations due to the heterogeneity of thermal conductivity and heat capacity in different materials, and varying power profiles that result from nonuniform functional unit activities, placements, and schedules. Fig. 4 shows the interelement thermal-temperature-difference distribution using homogeneous meshing of the example shown in Fig. 3 . The temperature differences between all pairs of adjacent thermal elements are considered. These values are normalized to the smallest value encountered for this example. Fig. 4 contains a wide distribution of temperature differences. Heterogeneous spatial-element-discretization refinement based on temperature differences has the potential to improve performance without impacting accuracy.
For dynamic thermal simulation, the size of each thermal element's time steps should permit accurate approximation by the element difference equations. An IC may experience different thermal fluctuations at different locations. Therefore, the best time-step durations for elements at different locations may vary. Fig. 5 shows the maximum potential time-step duration of each individual block based on local thermal variation. These values are normalized to the smallest maximum potential timestep duration of any block. Local adaptation of time-step sizes has the potential to improve performance without degrading accuracy. 
III. THERMAL ANALYSIS MODEL AND ALGORITHMS
This section gives details on the proposed thermal analysis techniques. Section III-A defines the steady-state and dynamic IC thermal analysis problems. Section III-B gives an overview of the algorithms used by ISAC, the proposed thermal analysis tool. Section III-C gives an overview of multigrid analysis and describes the spatial-adaptation techniques used by ISAC to accelerate analysis. Section III-D gives an overview of time marching and describes the temporal-adaptation techniques used by ISAC. Section III-E points out the accuracy benefits of considering the dependence of thermal conductivity on temperature within a thermal model. Finally, Section III-F explains the interface between ISAC and an IC synthesis algorithm.
A. IC Thermal Analysis Problem Definition
IC thermal analysis is the simulation of heat transfer through heterogeneous material among heat producers (e.g., transistors) and heat consumers (e.g., heat sinks attached to IC packages). Modeling thermal conduction is analogous to modeling electrical conduction, with thermal conductivity corresponding to electrical conductivity, power dissipation corresponding to electrical current, heat capacity corresponding to electrical capacitance, and temperature corresponding to voltage.
The equation governing heat diffusion via thermal conduction in an IC is given as follows:
which is subject to the boundary condition
In (1), ρ is the material density; c is the mass heat capacity; T ( r, t) and k( r) are the temperature and thermal conductivity of the material at position r and time t; and p( r, t) is the power density of the heat source. In (2), n i is the outward direction normal to the boundary surface i, h i is the heat-transfer coefficient; and f i is an arbitrary function at surface i. Note that in reality, the thermal conductivity k also depends on temperature (see Section III-E). ISAC supports arbitrary heterogeneous 3-D thermal conduction models. For example, a model may be composed of a heat sink in a forced-air ambient environment, heat spreader, bulk silicon, active layer, and packaging material or any other geometry and combination of materials.
In order to do numerical thermal analysis, a seven-point finite-difference discretization method can be applied to the left and right sides of (1), i.e., the IC thermal behavior may be modeled by decomposing it into numerous rectangular parallelepipeds, which may be of nonuniform sizes and shapes. Adjacent elements interact via heat diffusion. Each element has power dissipation, temperature, and thermal capacitance as well as thermal resistance to adjacent elements. The discretized equation at an interior point of a grid element is given as follows:
where i, j, and l are discrete offsets along the x, y, and z axes. Given that ∆x, ∆y, and ∆z are discretization steps along the x, y, and z axes, V = ∆x∆y∆z. G x , G y , and G z are the thermal conductivities between adjacent elements. They are defined as follows: G x = k∆y∆z/∆x, G y = k∆x∆z/∆y, and G z = k∆x∆y/∆z. ∆t is the discretization step in time t.
For an IC chip-package design with N discretized elements, the thermal analysis problem can be described as follows:
where the thermal-capacitance matrix C is an [N × N ] diagonal matrix, the thermal-conductivity matrix A is an [N × N ] sparse matrix, T (t) and P (t) are [N × 1] temperature and power vectors, and u(t) is the time-step function. For steadystate analysis, the left term in (4) expressing temperature variation as function of time t is dropped. For either the dynamic or steady-state version of the problem, although direct solutions are theoretically possible, the computational expense is too high for use on high-resolution thermal models.
B. ISAC Overview
Fig . 6 gives an overview of ISAC, our proposed incremental space-and-time-adaptive chip-package thermal analysis tool. When used for steady-state thermal analysis, it takes a 3-D chip and package thermal-conductivity profile, and a power dissipation profile as input. A multigrid incremental solver is used to progressively refine thermal-element discretization to rapidly produce a temperature profile.
When used for dynamic thermal analysis, in addition to the input data required for steady-state analysis, ISAC requires the chip-package heat-capacity profile. In addition, it may accept an initial temperature profile and an efficient thermal-element discretization. If these inputs are not provided, the dynamicanalysis technique uses steady-state analysis to produce its initial temperature profile and element discretization. It then repeatedly updates the local temperatures and times of elements at asynchronous time steps, appropriately adapting the step sizes of neighbors to maintain accuracy.
As described in Section III-E, after analysis is finished, the temperature profile may be adapted using a feedback loop in which thermal conductivity is modified based on the temperature in order to account for nonlinearities induced by the dependence of thermal conductivity or leakage power consumption on temperature. Upon convergence, the temperature profile is reported to the IC synthesis tool or designer.
C. Spatial Adaptation in Thermal Analysis
During thermal analysis, both time complexity and memory usage are linearly or superlinearly related to the number of thermal elements. Therefore, it is critical to limit the discretization granularity. As shown in Fig. 4 , IC thermal profiles may contain significant spatial variation due to the heterogeneity of thermal conductivity and heat capacity in different materials, and the variation of power profile.
In this section, we present an efficient technique for adapting thermal-element spatial resolution during thermal analysis. This technique uses incremental refinement to generate a tree of heterogeneous rectangular parallelepipeds that supports fast thermal analysis without loss of accuracy. Within ISAC, this technique is incorporated with a multigrid numerical-analysis method, yielding a comprehensive steady-state thermal analysis solution. Dynamic thermal analysis also benefits from the proposed spatial-adaptation technique due to the dramatic reduction of the number of grid elements that must be considered during time-marching simulation.
1) Hybrid Data Structure: Efficient spatial adaptation in thermal analysis relies on sophisticated data structures, i.e., it requires the efficient organization of large data sets and representation of multilevel modeling resolutions. In addition, efficient algorithms for interlevel transition are necessary for adaptive thermal modeling and numerical analysis. In ISAC, the proposed spatial-adaptation technique is supported by a hybrid octree data structure, which provides an efficient and flexible representation to enable spatial-resolution adaptation. A hybrid octree is a tree that maintains spatial relationships among rectangular parallelepipeds in three dimensions. In a hybrid octree, each node may have two, four, or eight immediate children. Fig. 7 shows an example of a hybrid octree representation. As shown, in the hybrid octree, different modeling resolutions are organized into contours along the tree hierarchy. In this example, nodes (elements) 1-8 form a level-1 contour; nodes (elements) 1, 2, 4-7, 9-14 form a level-2 contour; leaf nodes (elements) 1, 2, 4-7, 10-16, which are shown as shaded blocks, form a level-3 contour. Heterogeneous spatial resolution may result in a thermal element that resides in multiple resolution levels, e.g., element 2 resides in levels 1, 2, and 3. This information is represented as nodes existing in multilevel contours in the tree.
The hybrid tree structure also enables a compact representation of the thermal-conductivity matrix. Within the thermalconductivity matrix, each nonzero item g i,j represents the thermal conductivity between two adjacent thermal elements i and j. Since the hybrid tree structure contains complete connectivity information of the thermal elements within each contour level, it enables efficient matrix indexing and minimizes both the memory use and the computational time required by matrix operations.
The intergrid thermal conductivity between two adjacent thermal elements i and j is determined as follows:
where k i and k j are the thermal conductivities of elements i and j, respectively. A i is the cross-sectional area of grid element i along the plane parallel to the face contacting grid element j. The converse is true of A j . t i and t j are the distances from the center of each grid element to its contact surface. Spatial-resolution adaptation requires two basic operations: partitioning and coarsening. In a hybrid octree, partitioning is the process of breaking a leaf node along arbitrary orthogonal axes, e.g., nodes 13 and 14 result from partitioning node 8. Coarsening is the process of merging direct subnodes into their parent, e.g., nodes 9-12 merged into node 3.
To conduct thermal analysis across different discretization resolutions, we propose an efficient contour-search algorithm with computational complexity O(N ) that determines thermal grid elements belonging to a particular discretization-resolution level. As shown in Algorithm 1, leaf nodes are assigned to the finest resolution level (lines 1-3). The resolution level of a parent node of a subtree equals the minimal resolution level of all of its intermediate children nodes level min minus one (lines 4-7 and 13). An element may reside in multiple resolution levels (lines [8] [9] [10] [11] [12] 2) Multigrid Method: For steady-state thermal analysis, the heat-diffusion problem is (approximately) described by the following linear equation: AT = P . The size of thermalconductivity matrix A increases quadratically with the number of discretization grid elements. Therefore, directly solving this equation is intractable. Iterative numerical methods are thus widely used. The quality of iterative methods is typically characterized by their convergence rates. Convergence rate is a function of the error field frequency [15] . Standard iterative methods, such as those of Jacobi and Gauss-Siedel, have slow convergence rates due to inefficiency in removing lowfrequency errors. This problem becomes more prominent with fine-grain discretization.
In this paper, we describe a multigrid iterative relaxation solver for steady-state thermal analysis. Multigrid methods are among the most efficient techniques for solving large-scale linear algebraic problems that arise from the discretization of partial differential equations [15] , [16] . In conjunction with linear solvers, the multigrid method provides an efficient multilevel relaxation scheme. Using this technique, low-frequency errors, which limit the performance of standard iterative methods, are transferred into the high-frequency domain through grid coarsening. Algorithm 2 shows the proposed multigrid method. This method consists of a set of relaxation stages across the discretization hierarchy, where each stage is responsible for eliminating a particular frequency bandwidth of errors. Given a thermal-conductivity matrix A and power profile P , a multigrid cycle starts from the finest granularity level (line 1), at which iterative relaxation is conducted using a linear solver to remove high-frequency errors until low-frequency errors dominate. Next, the solution at the finest granularity level is transformed to a coarser level, in which the original lowfrequency errors from the finest granularity level manifest themselves as high-frequency errors. This restriction procedure is applied recursively (line 9) until the coarsest level is reached (line 6). Then, a reverse procedure called prolongation is used to interpolate coarser corrections back to a finer level recursively across the grid discretization hierarchy (line 11). The final result is the estimated steady-state IC chip-package thermal profile.
Algorithm 2 Multigrid cycle
Require thermal conductance matrix A, power profile vector P Ensure AT = P 1: Presmoothing step: Iteratively relax initial random solution {HF error eliminated}. 2: subtask coarse-grid correction 3: Compute residue from the finer grid. 4: Approximate residue in the coarser grid. 5: Solve the coarser-grid problem using relaxation. 6: if the coarsest level has been reached 7:
Directly solve the problem at this level. 8: else 9:
Recursively apply the multigrid method. 10: end if 11: Map the correction back from the coarser to finer grid. 12: end subtask 13: Postsmoothing step: Add the correction to the solution at the finest grid level. 14: Iteratively relax to obtain the final solution.
3) Incremental Analysis: In this paper, the spatialdiscretization process is governed by temperature-difference constraints. Iterative refinement is conducted in a hierarchical fashion. Upon initialization, the steady-state thermal analysis tool generates a coarse homogeneous octree based on the chip size. Iterative temperature approximation is repeated until convergence to a stable profile. Elements across which temperature varies by more than thermal-difference constraints are further partitioned into subelements. Given that T i is the temperature of element i and that S is the temperature threshold, for each ordered element pair (i, j), the new number of elements Q along some partition g is given by
For each element i, partitions along three dimensions are gathered into a 3-tuple (x i , y i , z i ) that governs partitioning element i into a hybrid suboctree. The number of subelements depends on the ratio of the temperature gradient to the temperature-difference threshold. Therefore, some elements may be further partitioned, and local thermal simulation may be repeated. Simulation terminates when all element-to-element temperature differences are smaller than the predefined threshold S. This method focuses computation on the most critical regions, increasing analysis speed while preserving accuracy. The temperature-difference threshold S required to trigger further thermal-element partitioning is an input to ISAC. Therefore, high thresholds may be used during early design exploration in order to decrease thermal analysis time. For the results in this paper, we used a low threshold of 1 K.
D. Temporal Adaptation in Thermal Analysis
ISAC uses an adaptive time-marching technique for dynamic thermal analysis. A number of authors have written wonderful introductions to time-marching methods [17] , [18] . Time marching is a numerical method to solve simultaneous partial differential equations by iteratively advancing the local times of elements. The proposed technique is a time-marching finite-difference method [17] . The computational cost of such techniques is approximately e∈E u e c e , where E is the set of all elements, u e is the number of time steps for a given element, and c e is the time cost per evaluation for that element. The Runge-Kutta family of finite-difference methods is commonly used to solve discretized finite-difference problems, such as dynamic thermal analysis [17] . For Runge-Kutta methods, assuming a constant evaluation time and noting that all elements experience the same number of updates, run time can be expressed as uc e∈E n e , where n is the number of a block's transitive neighbors. For these methods, element time synchronization eliminates the need to repeatedly evaluate transitive neighbors, yielding a time cost of |E|uc.
Analysis time is classically reduced by attacking the number of time steps u by either using higher order methods that allow larger time steps under bounded error or adapting global step size during analysis, e.g., the adaptive Runge-Kutta methods. Higher order methods allow the actual temperature function to be accurately approximated for a longer span of time, reducing the number of steps necessary to reach the target time.
1) Two Popular Time-Marching Techniques:
For conventional synchronous methods, it is necessary to select a fixed time-step size that is small enough to satisfy an error bound, i.e.,
where h f is the fixed step size to use throughout the analysis, t is the time from the set of all explicitly visited times within the sample period U , e is an element from the set of all elements E, and S te is the maximum safe step size for element e at time t. Although the weaker assurance of accuracy at the sample period would be sufficient, in practice, this requires that accuracy be maintained throughout time marching due to the dependence of element temperatures on their predecessors.
Further improvement is possible via the use of a synchronous adaptive time-marching method. In such methods, the time step is adjusted such that the largest globally safe step is taken, i.e.,
where h s t is the step size to be used at time t. Globally adaptive synchronous time marching is used by existing popular dynamic thermal analysis packages, e.g., HotSpot [5] .
2) Asynchronous Element Time Marching: Although synchronous adaptive time marching has the potential to outperform nonadaptive techniques, much greater gains are possible. The requirement that all thermal elements be synchronized in time implies that at each time step, all elements must have their local times advanced by the smallest step required by any element in the model. As shown in Fig. 5 , this implies that most elements are forced to take unnecessarily small steps. If instead it were possible to allow the thermal elements to progress forward in time asynchronously, it would be possible to allow elements for which the temperature-approximation function accurately matches the actual temperature over a longer time span duration to choose larger steps. Thus
where h a te is the asynchronous adaptive step size to use for element e at time t. If, at many times i.e., the average step size is much greater than the adaptive synchronous step size, as is clearly the case for the dynamic IC thermal analysis problem (see Section II), then asynchronous element time marching clearly holds the potential to dramatically accelerate dynamic thermal analysis compared with nonadaptive and synchronous adaptive techniques. However, reaching this potential requires that a number of problems first be identified and solved: Asynchronous element time marching increases the cost of using higher order methods and increases the difficulty of maintaining numerical stability.
3) Impact of Asynchronous Elements on Order:
Recall that thermal-element temperature-approximation functions depend on the temperatures of an element's (transitive) neighbors at a consistent time. Determining these temperatures is trivial in conventional synchronous time-marching techniques: All elements have the same time. However, asynchronous time marching requires that consistency be achieved despite the differing thermal-element local times.
Although many time-marching numerical methods for solving ordinary differential equations are based on methods that do not require explicit differentiation, these methods are conceptually based on repeated Taylor series expansions around increasing time instants. Revisiting these roots and basing time marching on Taylor series expansion allows asynchronous element-by-element time-step adaptation by supporting the extrapolation of temperatures to arbitrary times.
For many problems, the differentiation required for calculating Taylor series expansions is extremely complicated.
Fortunately, for the dynamic IC thermal analysis problem, the problem is tractable. Noting the definitions in (3) and given that T i (t) is the temperature of element i at time t, G in is the thermal conductivity between thermal elements i and n, V i is the volume of thermal element i, N are the element's neighbors, and M is the neighbor depth, we know that the net heat flow for a given thermal element i is zero, i.e.,
This can be simplified by introducing a few variables. Let
and solve for T (t) given by
by 15 and 16 (17)
Let
The linearity theorem for γ can then be applied as follows:
Let s = 0 to yield A = 1/α, and let s = −α/κ to yield B = −κ/α. We then have
Note that although the impact of transitive neighbors is not explicitly stated, it may be considered in higher order methods. Thus, β should be redefined to explicitly consider transitive neighbors, i.e.,
Thus, the nearest neighbor approximation of the temperature of element i at time t + h is given as follows:
Boundary conditions are imposed by the chip, package, and cooling solution. Note that this derivation need not be carried out online during thermal analysis. It is done once for an update function and the resulting equation. It is possible to use an exact local update function, such as (26), or an approximation function based on low-order Taylor series expansion. In practice, we found that a first-order approximation was sufficient for local updates as long as the impact of transitive neighbors was considered via (27) and (28). Note that the potentially differing values of step size h and local time t for all thermal elements imply that the number of transitive temperature extrapolations necessary for an element to advance by one time step may not be amortized over multiple uses as in the case of the synchronous Runge-Kutta methods. We will contrast a conventional Runge-Kutta method with ISAC to illustrate the changes necessary for asynchronous element time marching. For the sake of explanation, consider the fourth-order Runge-Kutta method, which is used for the purpose of comparison in Section IV-B. Given that N i is the set of block i's neighbors; p i , T i , T i , and κ i are the power consumption, current temperature, next temperature, and heat capacity of element i, respectively; G in is the thermal conductivity between elements n and i; and h is the time-marching step size This amortization allows increases in the order of Runge-Kutta and explicit synchronous time-marching methods without great increases in computational complexity. However, asynchronous thermal analysis requires the extrapolation of the temperature of a thermal element to the numerous different local times of its neighbors. This prevents the amortization described previously. As a result, for 3-D thermal analysis using asynchronous time marching, the number of evaluations e is related to the transitive neighbor count d as follows:
In summary, although it is common to improve the performance of time-marching techniques by increasing their orders, thereby increasing their step sizes, for the IC thermal analysis problem, greater gains are possible by decoupling element local times, allowing most elements to take larger than minimumsized steps. However, this requires explicit differentiation and prevents the amortization of neighbor temperature extrapolation, increasing the cost of using higher order methods relative to that of using fully synchronous element time-marching techniques. As demonstrated in Section IV, this tradeoff is an excellent one: The third-order element-by-element adaptation method yields speed-ups in the range of 122.81-337.23× when compared to the fourth-order adaptive Runge-Kutta method.
4)
Step-Size Computation: We now describe the elementby-element step-size adaptation methods used by ISAC to improve performance while preserving accuracy. As illustrated in the right portion of Fig. 6 , dynamic analysis starts with an initial 3-D temperature profile and hybrid octree, which may have been provided by the synthesis tool or generated by ISAC using steady-state analysis, a chip/package/ambient heat capacity and thermal-conductivity profile, and a power profile. After determining the initial maximum safe step sizes of all elements, ISAC initializes an event queue of elements sorted by their target times, i.e., the element's current time plus its step size. The element with the earliest target time is selected; its temperature is updated; a new maximum safe step size is calculated for it; and it is reinserted in the event queue. The event queue serves to minimize the deviation between decoupled element current times, thereby avoiding temperature extrapolation beyond the limits of the local-time bounded-order expansions. The new step size must take into account the truncation error of the numerical method in use as well as the step sizes of the neighbors. Given that h i is element i's current step size, v is the order of the time-marching numerical method, u is a constant slightly less than one, y is the error threshold, F i is element i's limited-order temperature approximation function, and t i is i's current time, the safe next step size for a block, ignoring nonlocal effects, is given by i.e., the new step size is computed by determining the absolute difference between the result of taking two 3/4h steps and one 3/2h step and dividing the error threshold by this value. This is illustrated in Fig. 8 for a first-order method. The result is then taken to the power of the reciprocal of the order of the method. Note that (35) is the general expression. For the sake of example, the expression for a first-order method is given in the following: Given that dT i (t)/dt is the derivative of i's temperature with respect to time at time t
This method of computing a new step size is based on the literature [18] . However, it uses noninteger test step sizes to bracket the most probable new step size.
5) Neighborhood Time-Deviation Bounds for Numerical Stability:
It is necessary to further bound time-marching step sizes to ensure that the local times of neighbors are sufficiently close for accurate temperature extrapolation. For the sake of example, consider the situation illustrated in Fig. 9 . To the far left of an IC active layer, at position A, the power consumption of a thermal element has recently increased. As a consequence, the temperatures of elements increase in a wave that propagates rightward from position A. Note that the dashed line indicates the derivative of temperature with respect to time at time zero as a function of position and not the derivative of the temperature with respect to position.
The maximum asynchronous safe step size of an element to the far right of the IC, at position B, is large because the temperature function implied by the temperatures of other thermal elements in its neighborhood is easily approximated. For B, the rate of temperature change based on its thermal profile of its neighborhood is zero. Therefore, the element is marked with an update time far in the future. Unfortunately, the temperature change resulting from increase in A's power consumption may reach B's neighborhood before B's marked update time. Therefore, it is necessary to constrain the deviation of the local times of immediate neighbors in order to prevent instability due to unpredicted global effects. Given that N i is the set of i's neighbors and w is a small constant, e.g., three, the new step size is given as follows:
For efficiency, the h n of a neighbor at its own local time is used. This temporal-adaptation technique based on (27), (28), and (37) is general and has been tested with first-order, secondorder, and third-order numerical methods. As indicated in Section IV-B, the result is a 122.81-337.23× speedup without loss of accuracy when compared to the fourth-order adaptive Runge-Kutta method.
6) Comments on Adaptation in Simulation and
Numerical Analysis: Although the asynchronous method described in this paper is new, researchers have previously considered using asynchronous agents or elements in numerical simulation. Kozyakin provides a survey of and a tutorial on asynchronous systems, focusing on distributed computational networks [19] . Esposito and Kumar describe event detection and synchronization methods to allow mostly asynchronous simulation of multiagent systems, e.g., multibody systems [20] . The work of Devgan and Roher on adaptively controlled explicit simulation is the most closely related to the proposed technique [21] . They propose using piecewise linear functions to model the responses of circuit elements. Instead of moving forward in time at a constant pace, each time step moves to the nearest time at which any circuit element becomes quiescent. In contrast, ISAC directly supports smooth functions, such as those appearing in the heat-transfer problem, and allows step sizes to be adaptively controlled by error bounds instead of imposed by the structure of the piecewise linear models used in the problem specifications.
E. Impact of Variable Thermal Conductivity
The thermal conductivity of a material, e.g., silicon, is the ratio of its heat flux density to temperature gradient and is a function of temperature T . An IC's thermal conductivity k( r, T ) is also a function of position r. Most previous fast IC thermal analysis work ignores the dependence of thermal conductivity on temperature, approximating it with a constant. This introduces inaccuracy in analysis results. In contrast, ISAC models thermal conductivity as a function of temperature. Position-and temperature-dependent thermal conductivity is given as follows [22] :
where k 0 is the material's conductivity value at temperature 300 K and η is a constant for the specific material. Recalculating the thermal-conductivity value after each iteration for all the elements would be computationally expensive. In order to maintain both accuracy and performance, ISAC uses a postprocessing feedback loop to determine the impact of variations in thermal conductivity on temperature profile. As described in Section IV-A, neglecting the dependence of thermal conductivity on temperature can result in a 5
• C underestimation of peak temperature.
F. Use of ISAC in IC Synthesis
ISAC was developed primarily for use within IC synthesis, although it may also be used to provide guidance during manual architectural decisions. ISAC may be used to solve both steady-state and dynamic thermal analysis problems. For use in steady-state analysis, ISAC requires 3-D chip-package profiles of thermal conductivity and power density. The required IC power profiles are typically produced by a floor planner used within the synthesis process [14] , [23] , [24] . In this application, it produces a 3-D steady-state temperature profile. When used for dynamic thermal analysis, ISAC requires 3-D chippackage profiles of temperature, power density, heat capacity, (optionally) initial temperature, and an elapsed IC duration after which to report results. In this application, it produces a 3-D temperature profile at any requested time.
Both steady-state and dynamic thermal analysis solvers within ISAC have been accelerated using the techniques described in Sections III-C and D in order to permit efficient use after each tentative change to an IC power profile during synthesis or design. Use within synthesis has been validated (see Section IV) by integrating ISAC within a behavioral synthesis algorithm [14] .
IV. EXPERIMENTAL RESULTS
In this section, we validate and evaluate the performance of ISAC. Experiments were conducted on Linux workstations of similar performance. The evaluation focuses on accuracy and efficiency. ISAC supports both steady-state and dynamic thermal analyses. Steady-state thermal analysis is validated against COMSOL Multiphysics, a widely used commercial physics modeling package, using two actual chip designs from IBM and the MIT Raw group. Dynamic thermal simulation is validated against a fourth-order adaptive Runge-Kutta method using a set of synthesis benchmarks. Efficiency determines whether using adaptive thermal analysis during synthesis and design is tractable. To characterize the efficiency of ISAC, we compare it with alternative thermal analysis methods by conducting steady-state and dynamic thermal analyses on the power profiles produced during IC synthesis.
In all cases, convection was modeled as the thermal resistance from the top layer of the heatsink to the ambient. The thermal resistance along the heat-convection path can be estimated as follows:
where h c is the convection heat-transfer coefficient, which is a function of air flow rate. A s is the effective surface area of the heat sink.
A. Steady-State Thermal Analysis Results
This section reports the accuracy and efficiency of the steadystate thermal simulation techniques used in ISAC. We first conduct the following experiments using two actual chip designs: The first IC is designed by IBM. The silicon die is 13 × 13 × 0.625 mm, which is soldered to a ceramic carrier using flipchip packaging and attached to a heat sink. An 11 × 11 block static power profile was produced using a power simulator. The second IC is a chip-level multiprocessor designed by the MIT Raw group. This IC contains 16 on-chip MIPS processor cores organized in a 4 × 4 array. The die area is 18.2 × 18.2 mm. It uses an IBM ceramic column grid array package with direct lid attach thermal enhancement. The static power profile is based on data provided in the literature [25] . We validate ISAC by comparing its results with those produced by COMSOL Multiphysics, a widely used commercial 3-D finite-element-based physics-modeling package. Table I provides thermal analysis results produced by ISAC and COMSOL Multiphysics for these ICs.
Average error e avg will be used as a measure of difference between thermal profiles and is given by
where E is the set of elements on the surface of the active layer of the silicon die modeled by ISAC. T e and T e are the temperatures of element e reported by COMSOL Multiphysics (FEMLAB) and ISAC, respectively. Percentage error is computed with the fixed point of 25
• C, i.e., the ambient temperature, instead of 0 K. This is conservative; if comparisons were made relative to 0 K instead of 298.15 K, the reported percentage error would be substantially lower. Table I , the second and third columns show the peak and average temperatures of the surface of the active layer of the silicon dies of these ICs, as reported by ISAC. Compared to COMSOL, the average errors e avg are 3.1% and 1.0%. The next four columns show the efficiency of ISAC in terms of CPU time, speedup, memory use, and number of elements. For comparison, the next three columns show the efficiency of a multigrid analysis technique with homogeneous meshing. These results clearly demonstrate that element resolution adaptation allows ISAC to achieve dramatic improvements in efficiency compared to the conventional multigrid technique. ISAC achieves speedups of 27.50× and 690.00× relative to an efficient but homogeneous element-partitioning approach. Memory usage decreases to 5.6% and 2.4% of that required by the homogeneous technique. Note that multigrid steadystate analysis itself is a highly efficient approach [6] . Using COMSOL, both simulations take at least 20 min.
Existing academic IC thermal analysis tools neglect the dependence of thermal conductivity on temperature, potentially resulting in substantial errors in peak temperature. In previous work, this error was not detected during validation because the models against which they were validated also used constant values for thermal conductivity. Temperature varies through the silicon die. Therefore, ignoring the dependence of thermal conductivity on temperature may introduce significant errors. As described in Section III-E, ISAC supports modeling of temperature-dependent thermal conductivity. The last two columns of Table I show the peak and average temperatures reported by COMSOL Multiphysics when the thermal conductivity at 25
• C, i.e., room temperature, is assumed. It shows that for both chips, the peak temperatures are underestimated by approximately 5
• C. This effect will be even more serious in designs with higher peak temperatures. Note that the source of inaccuracy is not the specific value of the thermal conductivity chosen. No constant value will allow accurate results, in general. An accurate IC thermal model must consider the dependence of silicon thermal conductivity on temperature.
To further evaluate its efficiency, we use ISAC to conduct thermal analysis for the behavioral synthesis algorithm described in Section II. This iterative algorithm does both behavioral-level and physical-level optimization. In this experiment, ISAC performs steady-state thermal analysis for each intermediate solution generated during synthesis of ten commonly used behavioral synthesis benchmarks. Table II shows the performance of ISAC when used for steady-state thermal analysis during behavioral synthesis. The second, third, and fourth columns show the overall CPU time, speedup, and average memory used by ISAC to conduct steadystate thermal analysis for all the intermediate solutions. Column five shows the average error compared to a conventional homogeneous meshing multigrid method, the overall CPU time, and the average memory use, which are shown in columns six and seven. ISAC achieves almost the same accuracy with much lower run time. The last column shows the CPU time used by the behavioral synthesis algorithm. Comparing columns two and seven makes it clear that when used for steady-state thermal analysis, ISAC consumes only a fraction of the CPU time required for synthesis. It is feasible to use ISAC during synthesis.
B. Dynamic Thermal Analysis Results
In this section, we evaluate the accuracy and efficiency of the dynamic thermal analysis techniques used in ISAC. Heterogeneous spatial-resolution adaptation was evaluated via steady-state thermal analysis. We will now focus on evaluating the proposed temporal-adaptation technique. We apply this technique to second-order (second-order ISAC) and third-order (ISAC) numerical methods, which are then used to conduct dynamic thermal analysis on power profiles produced by our thermal-aware behavioral synthesis algorithm during optimization. Efficiency and accuracy are compared with a fourth-order adaptive Runge-Kutta method, which uses global temporal adaptation. The CPU time of ISAC is also compared to the CPU time for IC synthesis.
We use the same set of benchmarks described in the previous section. To generate dynamic power profiles, online power analysis is conducted during synthesis using a switching activity model proposed in the literature [26] . For each architectural unit, input data with Gaussian distribution are fed through an autoregression filter to model the dependence of switching activity and therefore power consumption on operand bit position. Table III shows the experimental results for dynamic thermal analysis. For each benchmark, columns two and five show the CPU time used by second-order ISAC and ISAC, respectively, to conduct dynamic thermal analysis for all the intermediate solutions generated by the behavioral synthesis algorithm. Column eight shows the CPU time used by the fourth-order adaptive Runge-Kutta method. In comparison, our techniques consistently speed up analysis by at least two orders of magnitude, as indicated by columns four and seven. As shown in columns three and six, ISAC produces results that deviate from those of the adaptive fourth-order Runge-Kutta method by no more than 0.05%. The last column shows the CPU time used by the behavioral IC synthesis algorithm. As indicated in the table, it would be impractical to use the adaptive fourth-order Runge-Kutta method within IC synthesis due to its high computational overhead. The CPU times required by the proposed thermal analysis techniques are similar to those required by IC synthesis. Therefore, it is practical to incorporate them within IC synthesis.
V. CONCLUSION
This paper has presented spatially and temporarily adaptive techniques for steady-state and dynamic thermal analyses during IC synthesis and design. The proposed techniques were used on a number of IC designs and synthesis test cases. They were validated against COMSOL, which is a reliable commercial finite-element physical-process modeling package, and high-resolution spatially and temporally homogeneous solvers. Dynamic spatial and temporal adaptations result in 21.64-690.00× and 122.81-337.23× speedups, respectively, while preserving accuracy. Moreover, improvements in the underlying model, e.g., considering the dependence of thermal conductivity on temperature, have allowed accuracy improvements of 5
• C when compared with IC thermal models that neglect this dependence. The proposed techniques make accurate dynamic and static thermal analysis practical within the inner loops of IC synthesis algorithms. They have been implemented as a software tool called ISAC, which will be publicly released [13] .
