Temperature is a dominant factor in the performance, reliability, and leakage power consumption of modern processors. As a result, increasing numbers of researchers evaluate thermal characteristics in their proposals. In this paper, we measure a real processor focusing on its thermal characterization executing diverse workloads.
Introduction
This paper presents an experimental approach to measure and characterize the thermal behavior of real processors and their workloads. The aim of our work is to provide experimental data and insights to further validate and address the issues and challenges associated with the simulation and modeling of the thermal characteristics of processor designs.
Temperature has become a first order processor design constrain and has proven to be a key limiter in performance, throttling, clock skew, leakage power, reliability, variability, and cooling costs for modern processors. This has resulted in an increase of thermal publications. Figure 1 shows the number of papers published recently, in top computer architecture and VLSI conferences, addressing thermal issues. The top line in the chart tracks the fraction of those papers using thermal simulation, while the bottom trend Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ASPLOS '10 line shows the fraction of those simulation-based papers which use HotSpot [26] . 1 .
Most thermal research approaches require methods to calculate and measure temperature. This is traditionally addressed in three ways: using thermal and power simulation infrastructures such as Hotspot [26] and Wattch [2] , hybrid approaches mixing thermal simulation and direct measurements of architectural state via performance counters, or via direct measurements using on-chip thermal sensor(s) such as thermal diodes.
Approaches using power and temperature simulation to estimate temperature offer the freedom to analyze different alternatives. However, the architectural simulations required to generate power consumption traces are slow, forcing architects to simplify their models. For example, most prior approaches either perform a single "long" simulation or use profile-based sampling simulations. For most modern processors, 1 billion instructions accounts for less than 1 second of execution. These constraints raise questions about the length of the simulation and the impact of profile-based sampling. Questions which are compounded by the effect in the thermal time response of different design parameters such as heatsink configurations and power densities.
In addition to simulation length and thermal time response (temporal resolution), spatial resolution is another challenging question regarding thermal simulation and measurement. This is mainly due to variable power densities, not just across major processor structures but among some of the substructures that comprise each processor block [1] . Diverse power densities inside major processor structures, together with their cycling and utilization behavior, may define different thermal characteristics not only across major structures but inside them as well.
This paper attempts to provide educated answers to these questions using experimental data, which is used to provide a thorough thermal characterization of most of the applications in the SPEC CPU 2000 and 2006 benchmark suites. The experimental results show that the majority of SPEC applications have very predictable thermal behavior.
The main contributions of this paper are: the characterization of the SPEC benchmark suites for multiple thermal related metrics (1) ; show that conventional thermal simulation methodologies based on short execution times using profile-based sampling methods like Simpoint [24] can be ill-suited for thermal characterization (2) ; provide preliminary results showing that currently used spatial resolution at block granularity has issues with some thermal metrics (3) ; provide an analysis of the thermal time response of real processor systems and discuss its impact on the required simulation time (4) ; show that IO intensive applications have more thermal variability than SPEC (5); provide a set of recommendations for future research involving processor temperature (6) .
The rest of the paper is organized as follows: Section 2 defines several temperature-dependent metrics, their related performance metrics, and the thermal implications of profile-based sampling. Section 3 addresses some issues related to the thermal measurement setup; Section 4 quickly summarizes the experimental setup and the parameters selected; Section 5 provides experimental results; Section 6 explains the related work; and finally Section 7 shows the conclusions and future work.
Temperature-Aware Metrics
In order to analyze applications from a thermal standpoint, a well defined set of temperature-aware metrics is required. Section 2.1 contains a summary of key thermal metrics. Section 2.2 details the related performance metrics. And lastly, Section 2.3 covers profilebased sampling and its potential impact on thermal evaluations.
Thermal Metrics
Temperature has a different impact on several key design factors such as timing integrity, reliability, leakage power, and cooling cost. For example, temperature has an exponential impact on leakage power but a linear relationship with various timing violations.
Furthermore, since power consumption is usually non-uniform across the die, this implies that temperature does not necessarily have a constant or uniform distribution across area or time. Thus, to properly characterize the thermal behavior of a processor, the pertinent metrics must capture both temporal and spatial temperature behavior of the design.
Thermal issues can be analyzed differently regarding their impact on timing, reliability, and leakage energy consumption. To address this, we define 3 different categories of thermal metrics: timing, reliability, and energy.
Timing
Thermal metrics regarding timing considerations include maximum temperature (MaxT ) and maximum spatial difference in temperature which we refer to as thermal gradient (GradT ). MaxT affects performance throttling. Furthermore, since temperature affects the operating frequency of circuits, it may limit the speed for the entire design. For the same reason, GradT affects performance because a high thermal gradient in the chip can induce skew violations and timing violations.
Reliability
We use RAMP [27] as a quantitative basis for reliability. This work describes 5 Mean Time To Failure (MTTF) wear out failure models: Electro Migration (EM), Stress Migration (SM), TimeDependent Dielectric Breakdown (TDDB), Thermal Cycling (TC) and Negative Bias Temperature Instability (NBTI).
Since MTTF is not additive, the average Failures in Time (FIT) per block is estimated as the application executes. By default, a block corresponds to a processor architectural unit. Although, our work also considers finer grain spatial resolution. In both cases, the FIT is proportional to the area. At the end of the execution, we add the area-weighted FITs to report the overall FIT value for the entire processors. Like the RAMP authors, we assume that all the different failure mechanisms have the same contribution to the overall FIT value, which is adjusted to a preset value. In our case, we adjust the FIT value for all the SPEC applications to 10,000. This is approximately equivalent to a MTTF of 11 Years which is a short but reasonable lifetime for a 65nm chip.
Electro migration: occurs when atoms migrate from one end of the interconnect to the other, eventually leading to increased resistance and shorts. The model used in this work for EM is defined as follows:
kT .
(1) Stress migration: Materials differ in their thermal expansion rate, and this difference causes thermo mechanical stress, referred to as Stress Migration. We use the following SM model:
Time-dependent dielectric breakdown: It is the result of the gate dielectric gradual wear out, which leads to transistor failure. Ramp uses TDDB model
Thermal cycling: Thermal Cycling is another reliability factor since the temporal thermal gradients e.g. power on and off and high frequency changes in power due to changes in workload behavior, affect the lifetime of the processor. There is no validated model for high frequency thermal cycles, but the effects of low frequency cycling can be modeled via:
Negative bias temperature instability: NBTI leads to upward shifts in the transistors' threshold voltage that leads to timing violations. Ramp uses NBTI model
Energy
Energy thermal considerations revolve mostly around leakage power (Leak) because of its temperature dependency. The BSIM3 model [14] provides a way to estimate leakage power 2 .
T ) Different processor blocks may have different transistor densities and/or leakage. Since this information is not available for real processors, we assume that leakage is proportional to area. As in previous reliability metrics, we consider both block as well as finer grain spatial resolutions.
Thermal Time Constant
In order to capture the thermal "speed of change", we define a thermal time constant (τ) as the time it takes the chip to cool down to (T 1 − T 0 ) * e −1 once the application stops executing. Similarly, τ can be also defined as the time it takes the chip to warm up from T 0 to (T 1 − T 0 ) * (1 − e −1 ) when the application starts executing. The behavior is similar to the time constant for an RC circuit. Our experimental data shows thermal time constants for modern processor that are in the order of milliseconds. This gives a clear insight regarding the amount of time needed to simulate to better characterize the thermal behavior of a design. For example, based in our experimental figures, it is necessary to wait between 5 and 300 ms to have a thermal swing of around 63%.
Related Performance Metrics
Performance metrics can be used as a proxy for thermal behavior. We use a set of performance metrics as an approximation to the associated temperature profile of the application. These metrics include average IPC and maximum IPC. For the latter, the thermal time constant is used to filter the performance spikes that do not have a considerable effect on temperature. The average IPC is, however, calculated over the entire execution time.
Profile-based Sampling Impact
Profile-based sampling techniques that leverage Basic Block Vectors (BBV) like Simpoint [24] are commonly used to summarize application execution. Their goal is to find a subset of the application that matches the behavior for the entire application, so that a smaller window of execution can be used for simulation instead of executing the program to completion.
Many thermal simulation methods use Simpoint to accelerate execution, with the assumption that thermal phases match performance phases. This assumption, however, induces significant error.
To determine this error, we gather a thermal trace for the entire processor, measured by an IR camera during program execution. We also gather performance metrics, in real-time, from the program executing within the same processor being measured. Simpoint [24] is then used to determine the simulation points and their weights. Each simulation point needs to be adjusted by its own weight, we define this as True Simpoint (True). Which is very rare in thermal simulations, since most previous works simply concatenate simulation points without weight (Typical). Some authors go one step further, using only one simulation point. Which can be either the most representative, but more commonly, only the first simulation point (First) is selected to further reduce simulation time. Although a few papers do not perform warm-up, it is well understood that thermal warm-up is required. Unless otherwise stated, we perform warm-up before the first simulation point.
Thermal Measurement Setup
We leverage previous infrastructures [6, 15] , developed to characterize chips thermally by directly measuring the temperature at the transistor layer using an infrared (IR) imaging system. Traditional chip cooling solutions use metal heatsinks and heat spreaders, which are opaque to the infrared spectrum. An IR-transparent cooling solution based on a continuous flow of IR-transparent oil is used to cool the target chip, while allowing us to observe its operation under nominal frequency and voltage specifications. A key difference with the [15] setup is the addition of a 3mm thick sap- phire window 3 on top of the die to compensate for the replacement of the metal heatsink and heat spreader combo. The objective of this section is to analytically and empirically show that an oil cooling solution could be used to characterize thermal phases. In the following subsection we address questions regarding oil cooling efficiency (thermal resistance), the oil flow direction, and the oil thermal transient response (thermal capacitance).
Oil Cooling Efficiency
Different cooling solutions can have different specific heat, thermal capacitance, and thermal resistance. For steady-state analysis, only the thermal resistance affects the cooling efficiency of the system. If we apply a constant power, the thermal capacitance does not have any effect. Heat transfer equations are similar to the Maxwell electrical equations. The electrical capacitors do not have any effect when a constant voltage is applied. Similarly, the thermal capacitance does not affect the temperature for a constant power.
Metal has a different vertical and lateral resistance than oil. In addition, there is a Thermal Interface Material (TIM) between the silicon and the heatsink. Non-uniform thermal resistance raises possible issues with the IR measurement setup [9] . We adapt the solution presented in [12] . First, apply an infrared transparent sapphire window (SW) on top of the die to compensate for the TIM and resistance changes. Second, adjust the oil flow to match the cooling performance of the equivalent metal heatsink solution.
A sapphire window increases the thermal capacitance and improves lateral heat spreading. The other missing factor is the TIM. Typical TIMs have thermal conductivity between 1 and 4 W mK . This thermal resistance is placed between the silicon substrate and the metal heatsink. For the oil solution, we use oil as an IR transparent TIM. Liquids are effective TIMs but not commercially used.
Once we have a sapphire window, we control the oil volume flow as previously used by [15] . Nevertheless, there are physical limits or lower bounds beyond which the oil flow stops behaving like a laminar flow. To safely avoid difficult to model oil flow artifacts, we fix the oil flow speed to 16 m s , and restrict the minimum oil thickness to 1mm. For systems operating under 20W additional adjustments lowering flow thickness and rate may be required.
To evaluate the effectiveness of the previous corrections, we use a testchip with a 484mm 2 die area implemented on a BGA GL771 package as shown in Figure 2 . The chip is divided into a grid of blocks arranged as non-overlapping tiles of similar area. The power consumption for each block in the chip can be independently controlled. Each block also has a unique embedded thermal diode, which can sense sub-millisecond thermal responses with sub-centigrade error in temperature. We focus our attention on 3 specific blocks in the testchip: A single block in the center (P) which is powered arbitrarily, and two adjacent blocks (S1, S2) which remain inert. Figure 3 shows the temperature for blocks P, S1, and S2 in the testchip when different steady-state power consumptions are applied to block P. The oil flows from top to bottom. When Block Block P HS Block P SW Block S1 HS Block S1 SW Block S2 HS Block S2 SW Figure 3 . Temperatures for blocks P, S1, and S2 with different constant power in block P. HS and SW stand for AMD Mobile heatsink and sapphire window respectively.
P gets powered from 0 to 250 W cm 2 , we measure a consistent linear increase in temperature for the heatsink (HS) and the sapphire window (SW). This implies that the overall vertical thermal resistance of the oil and metal heatsinks are very similar.
To validate the lateral thermal resistance, we measure blocks S1 and S2. S1 is adjacent to block P, and therefore its temperature is very close to P when using both the metal and the sapphire solution. However, the temperature difference between block P and block S2, placed 5mm away from each other, increases as the power density increases. This is because the lateral thermal resistance creates thermal gradients on the die. We observe that both the traditional heatsink and sapphire have a consistent slope.
This simple experiment also validates the behavior of the sapphire window as an alternative cooling solution to a metal heatsink with similar vertical and lateral thermal resistances.
Oil Flow Direction
To compensate for the different cooling efficiency across the die due to the direction of the oil flow, the IR setup performs an additional image correction over each captured frame. In the worst case, we observe a maximum temperature gradient of 4 • C between opposite sides of the testchip. This corresponds to approximately 0.2 • C correction for each mm that the oil flows over a hot block.
Our experimental evaluation isolated fewer gradients than reported by [9] . The reason is that the sapphire window attenuates the impact of the oil flow because it dissipates the temperature from a more localized area at the silicon substrate to a bigger area at the top of the sapphire window. To further reduce the gradients due to oil flow, it is possible to add a diamond heatsink on top of the sapphire window which increases the lateral resistance. Alternatively, we explore a software correction mechanism.
If all the blocks are uniformly heated, applying the
mm correction is a simple and effective alternative. However, real chips do not display such uniform temperature across their dies. Ideally, a model describing the fluid dynamics of the oil should be used to perform the oil flow correction. However, this solution is too compute-intensive especially considering the 4 • C worst case. Instead, we have a quick approximation estimating the oil flow correction. For every mm that the oil flows over a block, we adjust the correction by 0.2 *
BlockTemp−45
• C 10 • C . We never let the correction be negative. This is a simple algorithm with linear cost that provides a fast and effective solution. To evaluate the accuracy of the correction and the impact of the oil flow, we repeat the same experiment as in Figure 3 . This time we apply the oil from four possible directions when block P is powered with 7.8W. The values in parentheses are the uncorrected values obtained when the oil flow correction algorithm is not applied. The only blocks affected by the oil flow direction are S1 and S2 when we have a horizontal flow. Without correction, the maximum error is 1.8 • C (S1 with Left-Right flow). After the correction, the error is reduced to 0.9 • C.
Oil Thermal Transient Response
To validate the thermal response for the oil flow, we start by performing thermal simulations using an Athlon-like chip configuration with metal heatsink and oil. Then, we continue by performing experiments with the testchip. Figure 4 shows the simulated thermal transient response for four different cooling solutions using an Athlon-like chip with package. "HS" is a traditional heatsink solution; "Half HS" is equivalent to "HS" but halving the metal spreader thickness; "Oil" shows the theoretical thermal response with an oil cooling solution; "SW" shows the thermal response when a sapphire window is placed between the silicon substrate and the oil flow.
Reducing the heatsink thickness ("HS" vs "Half HS") affects the metal spreader thermal capacitance and lateral thermal resistance. As a result, a slightly different thermal response is observed but both cooling solutions still show the same thermal phases. While the additional thermal capacitance attenuates the thermal response, the decreased lateral resistance reduces the cooling efficiency.
As reported by [9] , "Oil" flow has a lower thermal capacitance and it has a faster response showing clearer thermal phases. The previous section has shown that a sapphire window (SW) reduces the thermal resistance difference between oil and the heatsink. Sapphire also increases the thermal capacitance because it has double the specific heat of copper (0.75
but approximately half the density. As a result, copper and sapphire have an equivalent thermal capacitance. In conclusion, "SW" has a more attenuated thermal response than "Oil".
In our experimental validation, we use a testchip which provides μs sampling capabilities. Figure 5 shows block P temperature when a periodic power pulse is applied to it and the rest of the chip is idle. The same power pulse is applied for a heat sink "HS" and a sapphire window "SW".
The 25ms power pulse is applied every 100ms (10Hz). We clearly observe that the thermal transients of the heatsink and the sapphire are very close. [9] pointed that an oil cooling solution with this power pulse would have a significant error for fast transients. The measured results show that a sapphire window solves the problem. We also perform 250 ms power pulses to validate the equivalence between the heatsink and the sapphire window for slower transients. Again, the proposed cooling solution closely matches the metal heatsink solution.
Combining the fast and slow transient response accuracy with the cooling efficiency validation for the vertical and lateral thermal resistances, we conclude that the oil cooling solution with a sapphire window is an appropriate vehicle to capture existing thermal phases.
Evaluation Setup Parameters 4.1 Performance Parameters
To gather IPC traces we use pfmon [11] , a performance monitoring tool. One of the machine specific registers (MSR) available in the CPU is used to count the number of retired instructions (RI). We program in a threshold that when reached, resets the RI counter to zero, resumes counting, and generates an interrupt. A time stamp is saved in the interrupt service routine for this event so that we can determine the time it takes to reach the threshold number of instructions to retire. To extract the basic block vectors (BBV), we use Valgrind − 3.3.0 [19] and exp − bbv plugin [24] . Then, we use Simpoint 3.2 [24] to generate the simulation points.
Thermal Parameters
The thermal measurement setup uses an infrared camera to directly measure the transistor temperature. The infrared camera is a FLIR SC-8000. This camera provides real-time thermal imaging a resolution of 1024x1024 pixels, with sampling rates of over 100Hz.
For each category of thermal metrics (performance, reliability, energy), we report the metrics based on constants extracted from 65nm technology files. Table 2 summarizes all the metrics and their parameters.
Category
Metric Parameters Table 2 . Thermal metrics constants.
As stated previously, this work focuses on the measurement of thermal characteristics from a real processor operating under nominal frequency and voltage. For this purpose, we measure an AMD K8-based processor, running at 1.7GHz. It offers an out-oforder single core, manufactured using a 130nm process, with a peak power consumption of 70Watts.
Cooling solution
As explained in Section 3, we use an IR transparent heatsink with a sapphire window that matches the cooling characteristics of the metal cooling solution for a Mobile Athlon processor.
In order to satisfy those requirements, we use an oil-sapphire combination as the cooling solution for this experiment. A 3mm thick sapphire window with a 50mm diameter is placed on top of the chip, which acts like a traditional heat spreader while allowing IR energy to be measured. This window is then cooled by a flow of mineral oil, which is also transparent to IR and removes the heat. The mineral oil is temperature-controlled by a heat exchange maintaining the oil between 15 • C to 20 • C. This allows the jet of oil to remove heat from the sapphire window optimally, while maintaining a laminar regime of flow with a speed of around 16 m s . We use purified mineral oil with a specific heat of 1.63 J gK .
Applications
We evaluate almost all of the applications of SPEC00 and SPEC06 suites (24 from SPEC00 and 22 from SPEC06). Reference input sets are used for All the SPEC benchmarks.
Applications with a mixture of computation and IO tend to display more varied thermal behavior as observed with our IR setup. Since all the SPEC applications are designed to be CPU bound, we complement them by also evaluating 5 applications involving I/O: System Boot, Linux make, pdflatex, emacs, and BDB. System Boot includes the time from when the machine is powered up until the Linux boot is finished; Linux make compiles and links the Linux kernel and its modules; pdflatex compiles this paper with pdflatex; emacs performs verilog-mode macro, which creates a module skeleton, performs auto-completion, and syntax check. Once the verilog macro is finished, we conduct a conversation with the ELIZA mode for natural language processing. The BDB test involves a database with 1000K random fixed-size records each containing 32-bit keys and 256-bit data fields. Pages are 32KB, bulk transfers buffer is set to 4MB, the log buffer is also 4MB, and the buffer cache is 8MB. We perform 5K random queries and back-toback update pairs.
For all the applications in SPEC, the execution time is limited to 90 seconds which is long enough to let capturing thermal transitions and far longer than most architectural simulations.
Evaluation
The evaluation is divided in 6 sections. Section 5.1 starts by characterizing SPEC applications and showing all the thermal metrics. Section 5.2 continues analyzing the impact of profile-based sampling with thermal simulations. Section 5.3 shows how to estimate the minimum simulation time. After showing the impact of other issues in Section 5.4 and discussing the impact of IO in Section 5.5, Section 5.6 concludes with enumerating a list of thermal modeling recommendations. Table 3 presents the performance, reliability, and energy thermal metrics for our target processor executing SPEC00 and SPEC06.
SPEC Characterization
MaxIPC as well as metrics capturing the maximum temperature and thermal gradient (MaxT , GradT respectively) only reveal a small portion of thermally-related issues that may be of interest. To complement these metrics, we provide temperature-dependent reliability (EM, SM, TDDB, TC, NBTI) and power (Leak) metrics. The reliability metrics in Table 3 are normalized to a Meat Time Between Failures (MTBF) of 57.08 years, in the same fashion leakage numbers are also normalized to 1.
From our experimental data, we observe some interesting thermal aspects triggered by SPEC execution. First, there is a lack of correlation between average IPC and temperature. The correlation between IPC and MaxT is 0.36 and between IPC and GradT is 0.27. For example, the CFP00 suite has an average IPC of 0.81, which generates an average MaxT 76 • C, however CINT00 displays a lower average IPC of 0.69 generating a higher average MaxT of 80 • C. A similar lack of correlation can be observed on a per application basis. For example, sixtrack has the highest average IPCs of 1.59 (with the second highest MaxIPC of 1.6) reaching MaxT and GradT of 78 • C and 41 • C respectively. Whereas mcf, with an IPC 77% lower, reaches a 4% higher maximum temperature. The reason for this lack of correlation is that MaxT and GradT report the maximum temperature or the maximum temperature difference. These values are not very correlated with average IPC. MaxIPC displays better correlation with temperature but the correlation is still low because short IPC spikes are not long enough to increase MaxT .
Better correlation can be found between temperature and the EM reliability metric. For example, specrand has both the lowest (MaxT ,GradT ) tuples with (55 • C, 17 • C) respectively, and the highest EM with 232.41 years. At the same time, sjeng has the highest (MaxT ,GradT ) tuple with (103 • C, 77 • C) and the lowest EM of 22.35 years. However, SM does not follow a similar trend since it is more dependent on spatial concerns, such as the temperature distribution across the die.
By analyzing the correlations in Table 3 , we can observe that all three categories of thermal metrics have a low correlation. As a result, one metric cannot be approximated from another.
From the transient temperature data captured with our experimental setup, we divide the thermal behavior for the SPEC suite in two categories: thermally predictable and thermally variable. For example, the thermally predictable category (Figure 6 ) for the SPEC06 benchmarks have a predictable plateau after the initial warm up. In contrast, the thermally variable category (Figure 7 ) have over 5 • C oscillations once the warm up is over. The majority of SPEC06 is fairly predictable, only a few benchmarks have significant performance oscillations. SPEC00 with a shorter execution time has even shorter oscillations, and only gcc, swim, mesa, lu- cas, gap, galgel, facerec, and mcf 4 could be classified as thermally variable. Of note gcc and facerec show oscillations over 10 • C. The temperature of each trace has a constant increase as it goes through time. Figure 6 and 7 show only 30 seconds of the execution, while Table 3 reports the results for 90-second long execution. Thus the maximum temperature reported for each application in Table 3 might be higher than what shown in these figures. Figure 8 shows the overlap between simulation points and the measured performance trace for the SPEC06 gcc benchmark. Each simulation point lasts only 100M instructions which corresponds to approximately 100 ms. We observe two key problems with simulation points. First, the ending temperature of each simulation point can be very different from the starting temperature of the following simulation point. For example, while simulation point 1 finishes in the high 60s • C, the following simulation point 2 starts in the high 70s • C. This leads to inaccurate simulations. Second, simulation points are too short to capture significant transients. Figure 9 shows a histogram for the maximum thermal gradient observed in each simulation point for all the SPEC applications. The majority of the simulation points have a thermal gradient under 2 • C. Some simulation points have over 4 • C because they are located during the warm up phase where the thermal gradient is higher.
Profile-based Simulation Impact

Apps
IPC MaxIPC
MaxT Table 3 . Performance and Thermal metrics for SPEC00, SPEC06, and IO Applications. Figure 10 shows the impact of the simulation points on IPC and different thermal metrics. There are five comparison points. Full corresponds to the complete execution of the program. Oracle is a weighed simulation point with an oracle setting its correct initial temperature. True uses a weighted average and the ending temperature from the last simulation point as the beginning temperature for the current point. Typical is the same as True sans a weighted average. This is the most typical method to use simulation points with temperature simulations. Some papers use only the first simulation point to perform their simulation (First).
As expected, IPC is very close to all the methods except First. For these metrics, First is not representative. Typical and True have similar results with the exception of leakage power where True has better results. In both cases, there is a 20% error for MaxT . The most problematic metric is reliability where the results are as bad as First. For the thermally variable subset of applications, the difference is even worse than the reported average normalized values (for example 1.18 vs. 1.03 for Oracle).
Oracle shows that simulation has potential to yield correct results if the temperate for each simulation point is set correctly. However, temperature has a significant state, making it necessary to thermally simulate several seconds before the simulation point starts to obtain the correct initial temperature.
Using simulation points with thermal simulations introduces potential issues when evaluating reliability metrics. With a worst case occurring when only one simulation point is used, in that case all the metrics but IPC have significant error. Simulation points have potential to yield correct results if the initial temperature is appropriately set for each simulation point. [3] proposes reusing the power consumption calculated in similar phases to approximate the starting temperature of the simulation point. We cannot verify the accuracy of this system with our setup.
Simulation Time
A common question by architects is "how long should I simulate?" Section 5.2 shows that profile-based simulation has several issues regarding thermal modeling. The answer is not simple because it depends on package, application, and metrics being tracked.
Package Impact
To understand the "the minimum time required to simulate" due to package constraints, we use the thermal time constant as defined in Section 2.1.4. If we ignore the material property changes due to temperature, the package thermal constant is independent of the power trace.
The package acts as a low-pass filter, attenuating those power cycles with frequencies higher than the cutoff frequency for the filter. Since the thermal constant operates in an RC fashion, we define the cutoff frequency as 1 2πτ . Therefore, we consider 2πτ as the minimum simulation time using the package as a constraint. Shorter simulation intervals are not long enough to pass the filter without the temperature being attenuated. A 50 ms thermal time constant (τ) implies that we need simulation periods of over 300 ms. In Figure 5 we observe a thermal time constant (τ) of 54 ms for our sapphire cooling solution.
Application Impact
The package for the processor provides a minimum simulation time required to obtain a meaningful thermal simulation (2πτ or 300 ms for the package analyzed in this paper). Nevertheless, applications may have longer phases leading to more moderate changes in power density. This section provides more insight into the minimum time for thermal simulations from the standpoint of the applications being analyzed.
Rangan et al [23] define Interval Variability as the difference in IPC of one interval against IPC of the previous interval. The authors observe that 300 cycles have an order of magnitude more interval variability than 10,000 cycles intervals. Both ranges are still much faster than the 300 ms cutoff frequency for the package in this study. Figure 11 shows the average Interval Variability between 5 ms and 0.5 secs. It shows that most of the variability happens in time intervals under 150 ms. For a thermal frequency cutoff of 300 ms, all the changes happening faster are attenuated. This is one of the reasons why so many SPEC applications fall in the thermally predictable category. Figure 12 shows the maximum Interval Variability. The SPEC applications have over 0.3 (or 30%) activity change for all time intervals. Therefore, thermal metrics may be affected by the potential changes in power activity. As a result, we need to look at the thermal metrics because the SPEC applications do not show clear points to constrain simulation time.
Thermal Metric Impact
In addition to the package and application, the metric being observed also has an impact in the simulation time requirements. Figure 14 . Measurement granularity error. Figure 13 shows the error or inaccuracy suffered when only a subset of the application is modeled for different metrics. IPC is the metric that requires the least simulation time, a few seconds are enough to yield relatively accurate results. As shown in Section 5.2, simulation points can further improve the accuracy yielding over an order of magnitude accuracy improvement given the same simulation time.
Performance and energy thermal metrics (MaxT , GradT , Leak) require longer simulation times, requiring over 10 seconds to achieve an error of less than 20%. Reliability thermal metrics are the most difficult to capture. They require over 40 seconds to have less than 20% error. The previous plot selects a subsection of the application after warm-up. If warm-up the simulation starts from the beginning of the application the same type of results still hold.
Since thermal metrics depend on the application behavior, it is more desirable to consider the number of instructions than the overall time of execution. For a 1.7GHz processor with an average IPC of 0.73, we need approximately 20 billion instructions for performance and power thermal metrics, and close to 100 billion instructions for reliability thermal metrics. For IPC estimation, we just need around 2 billion instructions after initialization which is a fairly common simulation length.
Other Thermal Issues
Spatial Resolution Impact
Up to this point, our evaluation has used a spatial resolution of 50x50μm. We now evaluate the impact of averaging the temperature inside each of the floorplan blocks. Figure 14 compares the different thermal metrics between fine grain and coarse grain. MaxT and Leak do not have significant differences, but most reliability metrics are completely displaced. The problem resides with the exponential dependence of reliability metrics with temperature, which makes them very sensitive to temperature distribution changes in a given area. Figure 15 shows the difference in MaxT and GradT when using fine grain vs coarse grain thermal measurements. Interestingly there seems to be a constant offset between them. This opens to opportunity to further optimizations. We observe that the spatial resolution has a big impact on many metrics because they are not linear with temperature. Nevertheless, more work is needed to quantify what is the correct granularity and how to manage the reliability metrics because they are the most sensitive to spatial resolution.
Heatsink and Power Density Impact
Finally we consider the impact of the cooling solution and power density on thermal modeling and simulation. While power density affects the maximum temperature, the heatsink affects the max temperature and the thermal time constant. Figure 16 shows the impact of changing the power density. Again, we power cycle block P with a 1Hz frequency (150ms active pulse and 750ms inactive pulse), under different power densities: 56
Power Density
The power density has no impact on the thermal constant, but it changes the max temperature that the system can reach. Power density affects the max temperature or T 1 from the (T 1 − T 0 ) * (1 − e −1 ) thermal time constant equation. Higher power densities may require faster response to avoid thermal emergencies. This is due to the fact that MaxT is reached faster, however it does not change the required simulation time to characterize the application.
Heatsink The thermal time constant is very dependent on the heatsink resistance and capacitance characteristics. Figure 17 shows the impact of the quality of the cooling solution. Block P is power-cycled at 1Hz, with 2 different heatsinks, the default AMD Mobile heatsink (Small) and a massive Cooler Master V8 (Big). Each heatsink is evaluated with two TIMs; TIM1 uses OmegaTherm 201 thermal paste, and TIM2 uses a liquid oil-based TIM. In addition, we use a 2mm thick silver/diamond heat spreader with the big heatsink and TIM2. The best cooling solution with a large heatsink has a 5 ms thermal time constant. The thermal response, max and min temperatures are very sensitive not only to the heatsink but also to the TIM solution. Both heatsinks are fairly close for TIM1, but as the TIM material improves, the big heatsink shows a significant cooling improvement. To push the limits of the air cooling solution, we use the diamond solution which improves the thermal characteristics reducing max and min temperature.
From Figure 17 , we note that the cooling solution is a key parameter for temperature modeling. Surprisingly, it is disregarded in several papers. This is a major concern because most papers do not specify their cooling approach, or adapt their cooling solution to the new power envelope for that matter. This is especially important with thermal throttling, because different cooling solutions could cause the system to slowdown or not throttle properly.
Another important observation is that different cooling solutions have different thermal time constants, and therefore cutoff frequencies that affect the required simulation time. In our lab, we have measured between 5 and 300 ms thermal time constants. The conclusion is that low power cooling solutions require longer simulation times than aggressive ones. As Section 5.3 shows, the simulation time should be the maximum of the package, application, and metrics time requirements.
IO vs SPEC Workload
As the simulation time section has shown, fast power consumption changes are filtered out. Therefore, applications may not show as many thermal oscillations as IPC oscillations.
Applications are thermally variable when they have power consumption changes that last in the same order as the thermal time constant. The more abrupt the power consumption change, the more significant the thermal oscillation. This means that many common applications involving large amounts of IO, interrupts, or varied distributions of IPC have a high chance of belonging to the thermally variable category. Figure 18 shows 5 different applications concatenated one after another. Unlike SPEC, these applications perform IO. We start by booting the system which includes BIOS and Linux kernel boot. After power up, the BIOS creates a fast spike reaching 65 • C before the Linux kernel starts to boot. The Linux kernel has significant thermal oscillations due to the IO activity periods which the CPU leverages to reduce power consumption.
The second application is Emacs. The first spike of 15 • C is due to loading Emacs and to perform a Verilog macro and loading ELIZA. Once we start to "converse" with ELIZA the power consumption of the system decreases significantly. We observed that this was typical in many applications that have a thermal spike while loading but that require little computation afterward leaving the CPU nearly idle. Although not possible to see in the temperature scale, each keystroke generates around 1 • C thermal spike. The third application is pdflatex. It has a behavior similar to Emacs with two thermal spikes followed by a cool down. The fourth application involves a BDB database, it is very IO intensive and shows constant changes in temperature. The last application just powers off the system. Again the same initial spike followed by a cool down is observed. Figure 19 shows the Linux kernel compilation (Linux Make). The first 40 seconds correspond to different dependency resolution and compilations with significant IO. The right most plateau reaching around 80 • C corresponds to the linking of the kernel modules.
With the exception of Linux boot, the previous applications have so much IO that they are not able to reach high temperatures. As it is shown in Table 3 , the maximum temperature on average is around 19 • C less than that of SPEC. In fact workloads with lowest temperature fall under this category of applications. These applications are thermally variable, with a low chance of triggering a thermal emergency. As is shown in Figure 6 and 7, many of the SPEC benchmarks show predictable thermal behavior while IO workloads mostly show variable thermal behavior. A common characteristic of many IO applications is a thermal spike followed by a cooling down phase. However, the behavior of interactive workloads is closely affected by the way a user interacts with the application. Hence the thermal behavior of these type of workloads is not easy to predict.
Another characteristic is that TC reliability metric is worse than the average SPEC application for Linux Make and System Boot. Also, Linux Make has worse SM reliability than any other application evaluated. While the MaxT and GradT for Linux Make are close to SPEC06 gcc, the IO activity makes Linux Make much worse than gcc for many reliability metrics. As Table 3 shows, lower average temperatures for IO workloads compared to SPEC applications implies lower leakage consumption.
Thermal Modeling Recommendations
The three commonly used temperature estimation approaches (using thermal and power simulation, hybrid thermal simulationmeasurement, and direct measurement) have their own strengths and weaknesses. Our work provides the following recommendations for improvement that can be applied to all three approaches:
• Profile-based sampling is effective for performance and power phase detection, but it potentially induces an elevated degree of error for thermal modeling, especially in the conventional way it is used in the thermal simulation studies. A way to mitigate the error would be to reuse the power consumption calculated in similar phases as proposed in [3] .
• The minimum simulation time depends on the package, the application, and the metrics being tracked. Simulation time should be selected to guarantee that the following three constraints are satisfied.
The packages provides an absolute minimum simulation time assuming a worst case application. For for the systems analyzed in this paper, the thermal time constant is 54 ms. Other packages measured have thermal time constants between 5 and 300 ms. As an example using the cutoff frequency from our measured systems, it is recommended a minimum simulation of 31 ms for the most aggressive cooling solutions and 2000 ms for more conservative ones. Embedded systems with even simpler cooling solutions may require longer simulation times.
The SPEC suite does not show decay of interval variability. Therefore, the maximum between thermal metrics and package simulation time constraints should be used to select the thermal simulation time.
Each thermal metric has a different simulation time constraint. For performance (MaxT , GradT ) and power (Leak) metrics 20 billion instructions after initialization keep the error under 20%. For reliability metrics close to 100 billion instructions are required.
• Most papers using HotSpot for thermal simulation do not modify 5 the heatsink. The heatsink should be adapted to the chip power consumption. Any thermal paper should specify their cooling solution capabilities and adjust to their power demands. Ideally, papers should also report the % of time that the systems has thermal emergencies (going over a given max temperature) before and after their optimizations.
• Although SPEC is widely accepted as a representative benchmark for performance and power, we think that more work is required to determine a representative thermal suite. Since I/O intensive applications have a very different thermal behavior, we think that researchers should try to include I/O benchmarks in their workload. This is hinted at the observation that relatively similar applications like gcc and Linux Make show very different reliability metrics.
• Workload selection is always important. This is also true for thermal evaluations because most SPEC applications behave like a simple heat phase followed by a flat plateau. Many benchmarks show under 2 • C oscillation after warm up. This may not be so important for papers that only care about temperature to estimate leakage power. Nevertheless, reliability, performance, or thermal modeling papers should try to incorporate IO apps, and dealII, gcc, mcf, milc, and sjeng from SPEC06.
• Commonly used functional unit granularity do not seem to be enough. The measurements show thermal discrepancies consistent with some power densities reports [4] . Current implementations of the three approaches do not provide such intra functional unit resolution. As the technology scales, this problem becomes less important because a single processor becomes a small part of the whole die. Nevertheless, it seems that further research is required to create models with finer spatial resolution for large structures like caches.
Related Work
Increasing power densities have led to a need to further study spatial and temporal requirements for proper power and thermal modeling. [8] investigates the relationship between core size and onchip hot spot temperature considering spatial low-pass filtering feature of temperature. [13, 17, 18] address the optimum locations and allocations for thermal sensors in reconfigurable and multiprocessor systems. In [16, 20, 28] authors investigate the effect of temperature-aware floorplaning and how the processor can exploit it for improved thermal management. In all these studies, the temperature is estimated using modeling and thermal simulations.
A majority the studies in microarchitecture use the SPEC benchmark suite to evaluate their proposed methods and design proposals. Nevertheless, no thermal characterization of the benchmarks itself is available to the community. (see [21, 22] for the studies addressing benchmark evaluation with objectives other than temperature). We believe that this work is the first to characterize the thermal behavior of the SPEC benchmark, using accurate thermal measurement.
Several studies investigate phase behavior of applications for characterization and optimization purposes [3, 5, 10, 24, 25] . These studies are referred to by many of the simulation-based studies addressing thermal issues in order to take advantage of skipping the simulation of redundant execution phases in the application. Except [3] that exploits reusing power traces, users hold the assumption that performance phases match thermal phases. Nonetheless, there is no study to verify this assumption. This work evaluates it by measuring the exact error imposed by this assumption through live experiments.
Conclusions
This paper measures real systems and provides many insights and recommendations to the computer architecture community working with thermal modeling. In addition, the main contribution of the paper is to show that commonly used statistical sampling based thermal simulation methodology has significant problems with key thermal metrics. While Simpoint has less than 1% IPC error, they have a large error for key reliability metrics.
To better understand why conventional thermal simulation methodology based on statistical sampling does not work with temperature, we provide measurements to estimate simulation time impact. We show that thermal constant is over 50 ms for a modern package, as a result without significant changes in power density thermal simulations with a few milliseconds does not make much sense. Simulators need to perform large simulations to capture thermal phases available in programs. Papers that perform sub second simulations with a realistic heatsink configuration do not have time to perform significant thermal transients.
Leveraging the measurements, we propose a list of recommendations for researchers performing thermal simulations. The evaluation also thermally characterizes SPEC00 and SPEC06 for the first time. We classify thermal applications between thermal predictable and thermal variable. In addition, we show that IO applications not seen in SPEC benchmarks show more variable thermal behavior.
