Abstract-With the scaling of CMOS technologies, the gap between nominal supply voltage and threshold voltage has decreased significantly. This trend is further amplified in low-power nanometer libraries, which feature cells with identical size and functionality, but different threshold voltages. As a consequence, different cells may have different delay behaviors as the temperature varies within a circuit. For instance, cells with low-threshold devices may experience an increase in delay when temperature increases, whereas cells using high-threshold devices may experience the opposite behavior. The latter effect, also known as inverse temperature dependence (ITD), poses new challenges to circuit designers. Besides making timing analysis more difficult, ITD has important and unforeseeable consequences for power-aware logic synthesis. This paper describes the impact that ITD may have on the design of nanometer circuits. We also provide a threshold voltage assignment algorithm for dual threshold voltage synthesis, which guarantees temperature-insensitive operation of the circuits, together with a significant reduction of both leakage and total power consumption. Experiments performed on a set of standard benchmarks show timing compliance at any operating temperature, and an average leakage reduction around 28% compared to circuits synthesized with a standard synthesis flow that does not take ITD into account. We also apply our proposed synthesis algorithm to a realistic case study consisting of a 32-bit, IEEE-754 floating point unit.
I. INTRODUCTION
F OR MANY YEARS, the miniaturization of MOS devices provided unprecedented benefits in the form of higher functional density, higher chip performance, and lower IC cost. However, as hundred of millions of transistors began to be integrated in a single monolithic die, power dissipation and thermal management became major concerns. The increased transistor and current density has raised total power consumption and power per unit area tremendously. Moreover, at nanometer dimensions, static power (i.e., leakage) grows exponentially and has now paired dynamic power [1] . Since the power consumed by ICs is converted into heat, the corresponding heat densities also rise with increased power densities, thereby resulting in very high operating temperatures. Furthermore, local hot spots, with much higher temperatures compared to the average die temperature, may occur.
High-temperature and large-temperature fluctuations have shown to have a negative effect on power dissipation, reliability, and fabrication cost. Factors influencing leakage current become more prominent at higher operating temperatures [1] . For instance, at higher temperatures, Fermi-level lowering and increased subthreshold conductance cause subthreshold currents to increase, thereby increasing static power dissipation and eventually leading to even higher operating temperatures. Operating at such high temperatures can seriously impact the reliability of circuits [2] . Furthermore, the operating lifetime of devices and interconnects may be reduced by a number of thermally induced physical effects, such as negative bias temperature instability, dielectric breakdown, and electro-migration. Thus, unless the generated heat is removed at a rate equal to or greater than its rate of generation, the mean time to failure (MTTF) will drop to unacceptable levels. Needless to say, maintaining a smaller operating temperature requires complicated packaging and cooling techniques, which drastically increase the fabrication cost of the chip temperature around 100 .
In addition to cost, reliability, and heat dissipation issues, higher temperatures imply that a chip must now operate under a wider temperature range. With wider ranges, the performance of the circuit may vary significantly, although not all elements are affected in the same way. Concerning wires, as on-chip temperature rises, the metal resistivity of the global interconnects increases [3] , leading to significant performance degradation. As a result, designers often build in extra slack along these interconnect paths in order to guarantee timing compliance even at high temperatures. Even short wires may be affected by temperature fluctuations; however, variations in delay are much smaller and, since they are short and localized, they are affected all in the same way, so there is no gradient effect as with long wires.
Concerning devices, recent works have shown that, in sub-90 nm technologies, the propagation delay through a gate (or cell) can vary in complex ways as temperature varies. That is, depending on various parameters, such as cell size, load, supply voltage, and threshold voltage , the delay can either decrease or increase as temperature rises [9] .
1063-8210/$26.00 © 2009 IEEE Traditionally, it has been assumed that cell delay degrades as temperature rises. Therefore, the slow corner of a cell could be verified at high temperature (i.e., worst case), and the fast corner could be verified at low temperature. However, in low-power nanometer-scale technologies, this assumption no longer holds, and instead, an inverse temperature dependence (ITD) phenomenon may appear, such that cell delay may decrease as temperature increases [5] , [6] . While the ITD effect has been known for quite sometime, especially in analog MOS circuits, only recently, this phenomenon has been addressed explicitly in the design phase. For instance, Kursun et al. [5] propose a design methodology that optimizes the supply voltage to achieve temperature-variation-insensitive standard-cells. By using a supply voltage larger than the nominal one (around 15%-35% higher), minimum delay variation is guaranteed along the entire range of operating temperatures. Other works develop techniques to bound cell and path delays in order to obtain more accurate static timing analysis while taking into account the ITD effect [6] , [7] . Finally, detailed ITD models have been developed in [8] and used to capture the impact of ITD on repeater insertion. While all these approaches are useful, they do not provide any practical solution to address the synthesis challenges posed by ITD.
By characterizing cell delays using an industrial low-power 65-nm multitechnology, two simple rules of thumb for cell behavior can be obtained: The delay of fast, low-(LVT) cells increases as temperature increases, (i.e., the "classical" assumption), while the delay of high-(HVT) cells decreases as temperature increases. In other words, the worst-case delay of an LVT cell occurs at high temperature, while, for an HVT cell, the worst-case delay is at the lowest temperature. This new complicated aspect poses several limitations to timing-driven synthesis tools that do not consider the temperature as an explicit variable in their optimizations. For instance, the common practice has been to synthesize designs using cell libraries characterized for the worst case (highest temperature ). However, since new technologies exhibit an ITD effect, high temperature does not necessarily correspond to the worst-case condition, so a circuit may have timing faults when operating at other temperatures. To compensate for these temperature effects, the designer may be conservative when setting the timing constraints (i.e., setting the desired timing to be faster than actually required). In this case, even if the path delays increase for temperatures other than , one can guarantee that the global delay constraint is still met; however, it is at the cost of higher power dissipation and/or circuit area.
To address these limitations of the synthesis tools, in this paper, we present a new temperature-aware dualsynthesis methodology [9] , [10] . The proposed synthesis flow takes as a starting point an overconstrained circuit synthesized with a commercial tool and, through a temperature-driven cell replacement process, generates a leakage-optimized temperature-aware netlist. The entire flow has been developed according to standard, industry-strength tools and libraries. We show that our approach is able to automatically synthesize temperature-invariant designs that meet the given timing constraints over the entire range of allowed temperatures, while significantly reducing leakage power compared to circuits synthesized with a standard, non-ITD-aware commercial tool. Unlike many existing dual-threshold optimization techniques, our approach to assignment does not involve other kinds of gate/circuit modifications (e.g., gate resizing), but it only applies a postlayout selective cell swapping from LVT to HVT. As such, it can be seen as orthogonal to other leakage optimization solutions.
The paper is organized as follows. Section II provides some background on the temperature-induced effects on CMOS devices and the issues related to multidesign. Section III presents the analytical derivation of the ITD point, as well as details on how it can be characterized. Section IV describes the proposed thermal-aware multisynthesis methodology, whose effectiveness is discussed in Section V. Finally, Section VI draws a few conclusions and perspectives for future research.
II. BACKGROUND AND MOTIVATION

A. Temperature-Dependent MOSFET Drain Current Modeling
The alpha-power model [11] is commonly used to express the on-current of a short-channel MOSFET device. The expression for depends on three parameters that are strongly dependent on temperature: threshold voltage , mobility in the linear region , and saturation velocity in the saturation region . Their temperature dependence can be expressed as [12] follows:
where is the reference temperature (typically 300 K), , , and are the threshold voltage, mobility, and the saturation velocity at the reference temperature, respectively, and is the temperature coefficient (measured to be around 0. 8 ). The values and are technology-dependent temperature coefficients; is ideally 1.5, but it can reach 1, depending on the process, while is extracted to be around 150 . From (1)-(3), it is evident that all three parameters decrease linearly as temperature increases; however, they affect the drain current in different ways. In particular, lower (while operating in the linear region) or lower (while operating in the saturation region) cause the drain current to decrease; on the other hand, lower (in either region) causes the current to increase. The end result is that the drain current will show a direct or inverse temperature dependence, depending on the dominant parameter.
B. Multithreshold Voltage Design
Multithreshold voltage (MVT) is a power optimization technique that allows transistors with two different threshold voltages to be fabricated on the same chip (for example, see [13] and [14] ). The idea is to allow transistors on noncritical paths to be fabricated with a higher threshold voltage, so as to reduce the subthreshold leakage current.
While it is one of the most effective static techniques for active-mode leakage reduction, MVT poses new challenges in the field of circuit synthesis. For a given circuit netlist, synthesis tools must find the best assignment for each gate in a circuit, so as to minimize the leakage power while observing all timing constraints. This optimization problem is quite complex. For a circuit with gates and a dualtechnology process, there are possible assignments for the threshold voltages of the gates. Given the large number of gates that can be integrated on a chip, a complete design space exploration would require unacceptable computation times, thus necessitating the use of heuristics. The latter include those that perform simultaneous dualassignment and transistor/gate sizing (e.g., [15] - [24] ), or those that combine dualand gate sizing with other optimization techniques, such as dualand dualassignment, to reduce subthreshold current, gate current, and dynamic power simultaneously while achieving timing closure [25] - [27] . In addition, a number of recent works have proposed algorithms for dualsynthesis in the presence of process-induced variation [28] - [31] .
In this paper, we address the issue of MVT design under environmental variations that manifest as a change in temperature with respect to the nominal one. We have developed a novel dualsynthesis flow, which takes into account the temperature dependence of devices and yields low-leakage circuits that are temperature-insensitive.
Our contribution, however, is not a new MVT assignment algorithm, but rather a postprocessing step that is executed on an existing dualcircuit obtained by any algorithm, which results in swapping HVT and LVT cells to enforce temperature insensitivity. This choice was driven by the objective of maximum compatibility with standard design flows; as a matter of fact, our swapping algorithm operates on a dualnetlist obtained by using the MVT assignment algorithm from Synopsys PowerCompiler.
Our approach is therefore closer to pure "swapping" solutions, such as [32] - [34] , which rely on an initial multithreshold solution that is eventually perturbed to achieve the desired point in the leakage/delay space.
III. IMPACT OF INVERTED TEMPERATURE DEPENDENCE
A. MOSFET Drain Current
Recall from the discussion in Section II-A that the drain current can either increase or decrease as temperature rises, depending on which of the three parameters, i.e., carrier mobility, velocity saturation, and threshold voltage, dominates. For a given technology and threshold voltage, the operating supply voltage is the primary factor in determining which of these factors will dominate. When the supply voltage is significantly higher than , the reduction in mobility and saturation velocity dominates, thus leading to a reduction in drain current when temperature increases. However, when the supply voltage is closer to , the channel control voltage becomes more sensitive to variation, thus leading to an increase in drain current when temperature increases. The supply voltage at which no parameter dominates is known as the zero temperature coefficient (ZTC) voltage , and, by definition, it is the gate-to-source voltage, which makes MOS devices insensitive to temperature variations [4] . 1 The ZTC voltage can be found analytically by calculating the voltage value that yields . Clearly, the linear and saturation regions will have different values of . Since nanometer-scale devices operate in the saturation region for the majority of the time, all the following analyses refer to the ZTC voltage in this region.
What matters in this context is the temperature behavior of the device with respect to : if , then the drain current decreases as temperature increases, while if then the drain current will increase as temperature increases.
In Fig. 1 , we used SPICE to plot the drain current of a LVT NMOS device as a function of for three different operating temperatures. We used a commercial low-power 65-nm technology with . As shown in the figure, the simulated value for is found to be 0.626 V. The aforementioned analysis refers to the behavior of a "nominal" device. In real technologies, where variations of process parameters, such as body doping concentration, gate oxide thickness, and channel length, must be taken into account, every device comes in different configurations, corresponding to specific process corners. Most technology libraries feature three corners (fast, typical, and slow). Since these process parameters affect the value of the threshold voltage, and, in turn, the value of , it is important to evaluate for different corners. Table I reports and values for LVT, standard-(SVT), and HVT NMOS and PMOS devices for the three different process corners supported by our technology library: fast process (FP), typical process (TP), and slow process (SP).
Since is always larger than the given , the ITD effect can appear during strong inversion. Also, the better, the process 1 More precisely, it is the gate-to-source voltage, which minimizes the temperature-induced delay variation. In fact, since V and v have a different temperature coefficient, zero variation is not achievable. yield, the smaller the value (i.e., fast corner has the lowest ). Finally, due to asymmetric threshold voltages, the value for PMOS devices is larger than that of NMOS.
In this paper, we assume a nominal supply voltage of 1.0 V. This implies that, regardless of the process corner, LVT and SVT NMOS devices will have a direct temperature dependence , HVT NMOS and LVT PMOS devices will operate close to the point, and SVT and HVT PMOS devices will have an inverse temperature dependence (i.e., operate in the ITD region). Having different temperature dependencies even when operating at a single supply voltage can greatly complicate gate delay characterization. Consider a simple inverter made up of HVT transistors for a TP corner. Since NMOS transistors show an inverse temperature dependence, propagation delay for an output high-to-low transition shows an inverse temperature dependence as well. However, due to the direct temperature dependence of PMOS transistors, the low-to-high propagation delay reduces as temperature increases. As we will show in the next section, capacitive load and input's slew rate can shift the equivalent of the cell as well.
B. From Devices to Gates
The delay of a generic standard cell can be expressed by the well-known approximation provided by the alpha-power model [11] : (4) where represents the total load, is the supply voltage value, and is the (dis)charge current that flows from the output node to one of the supply rails through the internal transistors. Since the temperature dependence of the propagation delay is due to the active current of internal transistors , can also show a inverse or direct dependence.
As for single transistors, we can define a supply voltage at which the cell is temperature insensitive, namely, the temperature sensitivity of its propagation delay is almost zero. shows an inverse (direct) temperature dependence, namely, it increases (decreases) with temperature, and the cell becomes faster (slower) at 125 . Of primary importance is to observe that is not a technological parameter. It can vary in different ways, depending on the operating conditions. This is an important issue that has to be taken into account if we want to exploit this phenomenon. Aside from spatial and temporal drifts, which can alter the nominal electrical parameters of MOS devices, the also varies, depending on the logic function of the cell. Furthermore, the of a cell depends on which and how inputs are switching, the kind of output transition (low-to-high or high-to-low), and several circuit variables such as the slope of the input signals and the load capacitance.
As a summary of this complex set of interactions, Fig. 3 plots the of a single minimum-sized four-inputs NAND in a 65-nm technology (SP corner). More specifically, the three plots (from top to bottom) show the value for the LVT, SVT, and HVT cell configuration as a function of the input's signal slew rate, SlewRate, and the load capacitance, . The ranges of these two independent variables have been selected according to the data sheet provided by the library provider. Let us first consider the middle plot (NAND4X4_SVT). The voltage at which the cell is temperature insensitive decreases with the input slew rate, namely, the steeper, the input slope, the smaller, the . At the same time, the larger, the load, the smaller, the . In other words, the of a cell with fast input transitions and heavy load is smaller than the of a cell with slower input transitions and smaller load. Since the nominal supply voltage for our library is 1.0 V, the plot shows that the SVT NAND gate can either get slower or faster as temperature increases, depending on load and slew rate. Conversely, for HVT and LVT cells, two fundamental facts can be observed from the plots: 1) the of an HVT cell is always above the nominal supply voltage (1.0 V) and 2) the of an LVT cell is always below the nominal . This holds regardless of load and slew rate values. This implies that, all the LVT cells are on the right side of , and get slower with higher temperature, while all the HVT cells work in the ITD region and become faster at high temperature.
In order to show that the aforementioned behavior applies not just to a single type of cell, Fig. 4 shows the normalized propagation delay for three minimum-sized gates (INV, NAND3, and NOR2) as a function of temperature for the three different threshold voltages (LVT, SVT, HVT) provided by our library. The characterization was made for the nominal (1.0 V) and an equivalent fan-out load capacitance of 1. For LVT cells, the mobility effect dominates and the propagation delays increase with temperature (around 3% degradation for a 100 of temperature swing). Conversely, for the HVT case, the cells are more sensitive to threshold voltage lowering, and we observe a 6% performance improvement (ITD). Standard gates (SVT lines) show a nonmonotonic dependence. At lower temperatures, the mobility effect dominates, while, at high temperatures (higher than 80 ), the dependence makes the gates slightly faster (less than 1%). Other cells exhibit similar trends. Although the magnitude of the delay variations may appear to be small, it is worth emphasizing that these variations are expected to become more evident in future technology nodes, due to lower values, and more skewed values of low and high voltages.
In the next section, we will describe the deleterious effects that ITD may cause in MVT circuits when they are synthesized using non temperature-aware synthesis tools.
C. From Gates to Circuits
The main consequence of the contrasting behavior between cells having different is that traditional design methodologies, in which synthesis is performed for the high-temperature worst case, may result in timing failures, since the actual worst case for delay can occur at a temperature different from the highest (or lowest). Since the synthesis was done at 125 , all the paths meet the timing constraints at high temperature (all lines below the bound); however, due to ITD, the same cannot be guaranteed at room temperature. As shown in Fig. 5 , the critical path delay decreases at lower temperatures (mobility effects dominates in LVT cells), while the noncritical paths, which are mapped with HVT cells (threshold voltage effect dominates), get slower, thus exceeding the timing constraints ( line of the all HVT path). This timing fault can generate latching-error and metastability issues. Note that due to the intrinsic nature of the cells, for the paths all LVT and all HVT, the monotonicity is guaranteed along the entire temperature range. For paths consisting of a mixed distribution of cells, three different conditions may occur. If the amount of delay introduced by HVT cells is larger than the one of other cells, then the HVT cells impose their behavior; the path gets slower at 25 (path ) and a timing fault can appear. In contrast, if the LVT cells dominate, then the path delay decreases with temperature (path ) and the functionality is guaranteed for any temperature. Finally, if the SVT cells impose their behavior, then the worst case may appear at some intermediate temperature ( line of the mix ), causing a nonmonotonic behavior in path delay.
However, note that MVT synthesis algorithms implemented in commercial tools typically play with the assignment of LVT and HVT cells only because most technology libraries support only two threshold voltage values. When an additional threshold voltage (like the SVT in our library) is available, it is mainly used as a default value in the algorithms for gates that are assigned neither to HVT nor to SVT. Leveraging this fact and considering the nonmonotonic behavior of SVT cells versus temperature, we decided to restrict ourselves to a dualsynthesis, where only LVT and HVT cells are used. This guarantees that the actual worst-case delay can appear only at one of the two boundaries of the temperature range (i.e., or ), namely, the propagation delay of a dualpath is still monotonic.
This property can be easily proved by assuming that the temperature dependence of the propagation delay of a single gate can be approximated using a first-order linear equation, i.e., , where is the propagation delay at the minimum temperature and defines the slope. HVT cells will have , whereas LVT cells will have . If only HVT and LVT cells are allowed, the delay on a path is the sum of linear terms, i.e., a linear combination.
IV. TEMPERATURE-INSENSITIVE DUAL-ASSIGNMENT
The basic idea behind our temperature-aware MVT synthesis is best explained with the help of Fig. 6 , which illustrates the (delay, leakage) design space of the dualassignment problem. For a given circuit netlist, we assume that the gate-level implementation is fixed. Each possible threshold voltage configuration can be represented as a (delay, leakage) coordinate point. The all-LVT point indicates the delay/leakage coordinates when only LVT cells are used in the circuit. This point corresponds to minimum delay at the cost of highest leakage power [coordinate ] ; the all-HVT solution (when only LVT cells are used), conversely, minimizes the leakage power at the cost of performance [coordinate ] . These two points delimit the feasible solution space (the thick bounding box).
Let be the delay constraint specified by the designer; only the portion of the feasible area to the left of is considered timing compliant (the darker rectangular region). As discussed earlier, commercial MVT synthesis tools do not take into account the ITD effect, instead they typically assume worst-case conditions to be at the highest allowable operating temperature (i.e., ). Therefore, while the worst-case delay of the synthesized circuit may be within the specified value when operating at , there is no guarantee that this nominal delay is satisfied for the entire range of allowable operating temperatures. This situation is illustrated by the solid circle in the center of the graph, which depicts the range of solutions when the circuit is synthesized with a commercial tool with nominal delay set to . Some solutions may violate the timing constraint because the tool did not take into account that the worst-case delay might occur at some temperature other than . Knowing that the inverse temperature dependence is possible, a designer might try to compensate for the ITD effect by overconstraining the synthesis of the circuit by setting the target delay to be less than . The range of solutions obtained by taking this approach may be illustrated by the dashed circle to the left and above of the original solution space in Fig. 6 . The idea is that even if the path delays increase for temperatures other than , there is enough slack in the design such that the real nominal delay is still met. Unfortunately, this overconstrained approach also tends to generate circuits that are shallower and wider, thereby consuming more area and leakage power. To better address these issues, our approach takes into account the temperature inversion effect as part of the dualsynthesis process, so that timing constraints are met by construction at any operating temperature. We illustrate the range of solutions obtained by taking our approach by the dashed circle to the left and below the original solution space in Fig. 6 . We note that it is possible that our solution will yield a circuit with slightly higher leakage current than the original one synthesized at ; however, it will be guaranteed not to violate delay constraints, which the original circuit is often not able to do. The following section will present our dualsynthesis flow and algorithm in more detail.
A. ITD-Aware DualSynthesis Flow
Our proposed synthesis flow is illustrated in Fig. 7 . The flow has been set up with the objective of achieving maximum compliance with existing commercial tools.
We first synthesize the target circuit using the nominal timing constraint and standard duallibraries characterized at high temperature (i.e., 125 ). Synthesis at high temperature guarantees worst-case parasitic extraction. We then estimate the worst-case delay of the circuit at both 125 and 25 using static timing analysis. Next, we resynthesize the circuit using the same libraries, but with a tighter timing constraint (i.e., , where ). We choose to be small enough such that the new circuit is timing compliant at both 125 and 25 (i.e., worst-case delay is less than ). We note that the value of is circuit-dependent and, the larger, the ITD effect, the smaller, the value of must be.
At this point, we have a solution which is compliant from a timing viewpoint, but it represents an upper-bound for the leakage optimization problem. Using our temperature-insensitive dualassignment algorithm, we attempt to recover some of this leakage power by searching for an optimal threshold voltage assignment for all the cells in the circuit.
B. Dual-Vth Assignment Algorithm
As discussed in Section II-B, an exhaustive exploration of all possible dualassignments would require unacceptable computational time. Rather, we propose an iterative path-based heuristics, where, at each iteration, only a selected subset of cells is examined for assignment. More specifically, the subset is composed of cells in the critical path. Several assignments are explored, and the one which makes the current path timing compliant at both temperature boundary conditions is selected. The algorithm iterates until timing compliance is met for all paths in the circuit.
By considering only a subset of gates, we drastically reduce the complexity of the assignment process and the time required to extract the subset of gates. However, this new assignment may change the critical delay for paths outside this subset of cells, thereby altering the worst-case delay of the full circuit. Therefore, for the next iteration, we must select the next subset of cells based on the updated delay values. If the critical path changes significantly, more iterations will be required before the algorithm converges to a timing-compliant solution. In general, the efficiency of the algorithm strongly depends on the type and size of the selected subset. The pseudo-code for our -assignment procedure is shown in Algorithm 1. The procedure takes, as input, the timing constraint, , and the netlist, , generated from synthesizing the circuit with tighter timing constraints (i.e., ). Because this initial circuit has been generated using as timing constraint, we assume the circuit is timing compliant at both 25 and 125 . The procedure returns the netlist of the optimized circuit, .
As a first step, we replace all LVT cells in the circuit with HVT cells (Line 3). This step gives us a lower bound on the leakage power, but most likely violates timing constraints for some paths. We then compute the worst-case delays of the circuit at 25 and 125 ( and respectively) in Line 4. Next, our algorithm enters a while loop in Line 5 and adjusts threshold voltage assignments until the timing constraint is met for all paths. That is, we iterate until the worst-case delays of the circuit are less than or equal to than . At each iteration of the while loop, we extract the set of cells along the worst-case delay path of the circuit (Line 6), and create lists with the name of the cells , their threshold voltages , and flags ( ) indicating which cells are locked or unlocked (Lines 7-8). If a cell has been locked, this indicates that it already has been processed in a previous iteration and therefore cannot be modified. An unlocked cell indicates that it still has an HVT assignment and can be potentially swapped to LVT in order to meet timing constraints for the path under consideration.
The cells in the path are now sorted in Line 9 in increasing order of leakage values, and the algorithm enters a loop in Line 12 to iterate the replacement of cells. With each successive iteration of the while loop, if there is any unlocked cell remaining in the path, we assign an additional unlocked cell to an LVT value, increment , and check again for timing violations at both 25 and 125 . If the new configuration meets the timing constraints, we lock the cells (Line 18) and break out of the loop. Note that since we are using an ordered list, we swap cells from HVT to LVT starting from the one with minimum leakage. In this manner, we try to allow LVT cells to have minimum impact on overall leakage power.
Due to the assignments made in previous path iterations, it is possible that the path being processed cannot meet timing constraints. That is, since cells can be shared by multiple paths, too many cells in the current path may be locked such that even if all the unlocked cells are swapped to LVT, the timing constraints will still not be met. If the current assignments do not lead to a viable solution, the procedure backtrack is called in Line 21, which unlocks a cell contained in the current path that was already locked from processing a previous path, but has not already been assigned to a low-threshold value. This step effectively adds a new cell to the free variable list and allows us to search for a new feasible solution for this path. To make this change effective, the backtracking procedure must also update the path variables , , and . Note that by starting from a timing-compliant solution, we guarantee that the algorithm will always converge to a feasible solution (i.e., all LVT cells in the worst case).
V. EXPERIMENTAL RESULTS ON ISCAS BENCHMARKS
A. Physical Implementation Design Flow
In order to validate the proposed methodology, we set up a new synthesis flow based on commercial EDA tools and tested it on a set of benchmarks. Each benchmark was synthesized using Synopsys PhysicalCompiler onto a 65 nm technology library from STMicroelectronics. The synthesis was done using duallibraries, enabling power and area optimization features. The main advantage of performing a physical synthesis is that the final result yields a placed design with a parasitic estimation of the interconnects, thus allowing to obtain accurate timing and power estimations, which have been done using Synopsys PrimeTime and PrimePower.
In order to obtain a detailed estimation of dynamic power, for each circuit we evaluated the switching activity of the internal nodes; using Mentor Graphics ModelSim, dedicated testbenches were used to emulate actual workloads. By parsing the simulation outputs, we extracted both static probability and number of toggles for each node; these values were then annotated in PrimeTime/PrimePower during power estimation.
The tools and flows are based on the use of standard timing libraries characterized at several process/voltage/temperature corners using Cadence SignalStormLC. Since an actual CMOS standard library consists of hundreds of cells, characterization time can be very lengthy; therefore, we decided to use a subset of the library (250 gates out of more than 800). The characterization has been done for lowand highcells at the two boundary temperatures 25 and 125 , and for two other intermediate temperatures (75 and 105 ) .
B. Experimental Results on ISCAS Benchmarks
In this section, we present the results obtained by applying our methodology to the ISCAS'85 benchmark suite [35] . All experimental data are collected in Table II , where, for each benchmark, we show the results corresponding to the proposed thermalaware dualassignment strategies, and compare then against the approach based on over-design. We do not provide comparison against other dualalgorithms in the literature because they do not consider temperature as a variable.
The first two rows of each benchmark entry (i.e., rows with suffix and ) report the value relative to a fixed-temperature dualmethodology executed at 125 and 25 , respectively. The row with the suffix reports the over-constrained case, namely, a synthesis done at 125 using a timing constraint set to . The value of is shown in column . Finally, the row with suffix shows the results using the proposed synthesis methodology presented in Section IV.
The column labeled reports the total number of cells in the circuit, whose area is shown in Column . Columns and collect the actual area corresponding to LVT cells and HVT cells, respectively. The column labeled with reports the timing constraint imposed during synthesis for each circuit. We set to , where is the time when all gates are lowand are assigned to their minimum size.
Columns to report the post-synthesis critical delay at the four temperatures that have been characterized (25, 75, 105 and 125 ) . The last two columns report leakage power at these temperatures; since leakage is a monotonic function of temperature, we do only report values at the two extreme temperatures.
We start commenting results by focusing on benchmark c5315. The first observation concerns the inefficiency of a fixed-temperature synthesis. Performing a synthesis at high ), due to ITD, the timing margin is not respected for low temperatures, causing a timing-fault. The delay at 25 is 2.79 ns, while the constraint is 2.70 ns (i.e.,
).
The same problem appears if we synthesize the circuit at room temperature (i.e., first row ). The only difference is that now the the violation is at high temperature, where the low-threshold cells are slower, and the interconnects show larger parasitic effects.
An obvious solution of the timing-fault problem is to over-design the circuit (row ); by performing a 125 synthesis under a timing constraint which is smaller than the nominal one (i.e., with ), we can increase the available slack. In this way, we guarantee that, even if the paths get slower at room temperature, they remain below the nominal threshold. In the case of c5315, the circuit was synthesized with a constraint of 2.59 ns, so that the delay is smaller than 2.7 ns at all temperatures. The main shortcoming of this approach is that an over-designed circuit consumes more power. For instance, at 125 , the over-constrained design shows a leakage power consumptions that is 12% larger than the nonover-constrained design. Using our approach (row ), we are able to meet the timing constraint at all temperatures while reducing power; compared to the over-constrained solution, the leakage at 25 decreases from 166 nW to 77.4 nW, and from 6.44 to 3.20 at 125 , thereby obtaining a leakage reduction of more than 50% in both cases.
Similar considerations apply to the other benchmarks, although care should be used while observing the peculiar and complex relationship among number of gates, area, and power. We note that an increase in the total area does not necessarily correspond to an increase in the number of cells. For instance, if we compare against , the number of cells decreases from 1055 to 1051, but the total area increases from 3412 to 3453 . This happens because, in order to match the timing constraint, the synthesis tool uses a smaller number of larger cells. Over-sizing the gates reduces the propagation delay, but the power consumption increases.
Likewise, it is possible that the circuit synthesized with our flow has less leakage but larger area than the over-constrained design, as in the case of benchmark . Again, the reason lies behind the topological mapping process. Our algorithm selects candidate cells (for the HVT to LVT replacement) in increasing order of leakage. However, since cells with minimum leakage are not necessarily the smallest, it could be that the LVT area actually increases, while the total leakage power goes down.
One final remark about dynamic power; since our methodology is based on replacing HVT cells with LVT ones (and, in particular, it does not modify cell sizes), the only possible source of change in dynamic power arises from the different switching power between a HVT and a LVT instance of a given cell. In the library used in the experiments, such a difference has been measured to be truly negligible, with the result that the dynamic power of the original and multiimplementations are virtually identical; for this reason, we have not reported any figure relative to dynamic power in this section. Fig. 8 shows the timing analysis of two synthesis flows for the ISCAS benchmark c5315; more specifically, we plot the worstcase propagation delay at both 25 (dashed bars) and 125 (solid bars), in the case that the circuit is synthesized using a timing library characterized at 25 (i.e., S25) and 125 (i.e., S125), respectively. The delays are normalized to the actual timing constraint. The graph plots the resulting delay for different values of delay constraints, ranging from to .
C. Impact of Timing Constraints
Let us first consider the tightest delay constraint (i.e., ), where all the critical paths consist of LVT cells. If we consider the S25 flow, we obtain a circuit which is timing compliant when operating at 25 by construction, but, at 125 , since the LVT cells shows a direct temperature dependence, the propagation delay is about 3% larger then the timing constraint. In contrast, for the flow, we obtain timing compliance at 125 , but at 25 a timing-fault occurs. The reason is related to the fact that a path which is not critical at 125 is typically mapped with a large number HVT gates, and it becomes the actual critical path at room temperature (6% timing violation). This kind of situation appears also for delay constraints that range from to . In this range, our methodology can guarantee the timing closure at no risk of power penalty.
As the timing constraint becomes more relaxed (greater than ), HVT cells are being used also on critical paths. Therefore, the actual worst case occurs now at room temperature, and using the synthesis can be conservative enough to meet the timing requirements at any temperature. In fact at both temperatures, the circuit shows a propagation delay smaller than the constraint. Conversely, synthesizing at incurs a timing fault when operating at room temperature due to ITD.
D. Case Study
In order to demonstrate the applicability of the proposed methodology to circuits of realistic sizes, we tested our flow on an arithmetic floating-point unit (FPU) available on the OpenCores portal. 2 The core complies with the IEEE 754 standard for floating-point computation [36] and supports five arithmetic operations: Addition, subtraction, multiplication, division, and square-root. For each operation four different rounding modes are supported (Round-To-Nearest-Even, Round-To-Zero, Round-Up, Round-Down); the unit is designed so as to detect all the five IEEE standard exception types (Overflow, Underflow, Invalid Operation, Inexact, Divide-by-Zero), plus Infinity and Zero.
Each arithmetic operation is performed by a dedicated unit, which allows effective application of low-power techniques such as clock-gating and power-gating. The output of each single unit is connected to the primary output through a multiplexed bus that is controlled by an external control signal that specifies the type of operation.
As in the previous section, we compare the outputs of four different synthesis flows: Classical dualsynthesis at 25 and 125
, the over-constrained case , and our methodology . The target timing constraint of the FPU core is 9 ns , which is the best performance that we can achieve with our low-power library. Fig. 9 plots the length of the critical paths for each synthesis flow, at 25, 75, 105 and 125 .
As for the ISCAS benchmarks, with the synthesis, the timing constraint is met at 125 by construction, but when the core operates at room temperature, a timing violation occurs A similar problem occurs if we synthesize the circuit at room temperature (the approach). Now the timing violation happens at high temperature, where the low-threshold cells are slower. The over-constrained synthesis at 125 (with ) solves the problem, by keeping the critical path below the 9 ns constraint at all temperatures. The same happens by using our temperature-aware flow ( in Fig. 9 ), yet with a much smaller power dissipation, as shown in Fig. 10 . Table III reports various statistics of the synthesized FPU core for each synthesis flow. The column labeled contains the total number of cells, and indicate the number of LVT and HVT cells, respectively.
As discussed above, using a tighter timing constraint causes the circuit area to increase, as shown for the and the synthesis; however, notice that both approaches consume exactly the same area (92806.5 ) since they start with the same netlist and only the threshold voltage of selected cells is modified. The power efficiency of the case is therefore due to a different distribution of HVT and LVT cells (i.e., it uses fewer LVT cells compared to the ).
VI. CONCLUSION
The inversion of temperature dependence (ITD) in sub-90 nm CMOS devices raises new challenges for today's synthesis tools. In this article, we have explored the impact that ITD may have on circuits synthesized via traditional tools and flows. We have identified potential shortcomings of single-temperature, dualsynthesis tools; in particular, we have demonstrated the inadequacy of such an approach with respect to the adherence of the synthesized circuits to tight timing constraints under varying thermal conditions.
We also validated a new low-power dualsynthesis methodology that takes into account the inverse temperature dependence (ITD) phenomenon. Using our methodology, we are able to produce a circuit netlist that is timing-compliant along the entire operating range and with a reduced leakage compared to circuits generated using a commercial synthesis tool. We show over 28% leakage power improvement on both ISCAS benchmark circuits and on a realistic design.
