Abstract-Fused Multiply-Add (FMA) units are quite popular in floating-point execution units in state-of-the-art multicore processors. It has been shown that, for division operations, using digit-recurrence units consumes much less power and energy than using FMA units which are based on NewtonRaphson approximation algorithms. In this work, we show that digit-recurrence division units can also reduce on chip thermal coupling from hot blocks (e.g. FMAs) to cool blocks such as caches. By placing power efficient dividers between FMAs and a cache block, we lower down the average temperature by 5
I. INTRODUCTION
As CMOS technology continues to scale down, transistor's threshold voltage has been decreased in order to maintain switching speed and thus performance. The decrease in threshold voltage resulted in an exponential increase in subthreshold leakage current. Due to the increased leakage current and a large amount of idle devices, leakage power has become comparable to dynamic power in many applications. The International Technology Roadmap for Semiconductors (ITRS) [1] predicted that leakage power would contribute more than 50% of total power in the next generation processors.
The increase in leakage power is even more significant in caches, where only a few cache lines are accessed every clock cycle. The low switching activity in caches made leakage the dominant fraction of total power consumption. It is shown in [2] that in 70 nm process, leakage amounts to more than 70% of power consumed in caches. As on chip caches occupy increasing die area, controling leakage in caches is becoming more and more important.
The subthreshold leakage which is the major component in leakage is an exponential function of junction temperature. It has been shown that at high temperatures, the amount of leakage power can be orders of magnitudes larger than at low temperatures. Due to the leakage power's super-linear dependency on temperature, many methods are proposed both at run time and design time to improve a circuit's thermal profile in order to contain leakage. In [3] , [4] , [5] , the authors proposed dynamic thermal management methods to monitor and limit the peak temperature in a processor at runtime. In [6] , [7] , the authors developed thermal aware floorplanning algorithms to lower the peak temperature by reducing the thermal coupling between power hungry modules. In [8] , a floorplanning algorithm is proposed with the explicit objective to reduce leakage.
In this work, we first briefly analyze the impact on power dissipation of Floating-Point (FP) units that are typically implemented in muticores and GPUs. The unit considered for FP addition and multiplication is a Fused Multiply-Add (FMA) unit which is a popular choice in multicores [9] and GPUs [10] . For FP division we considered the multiplicative approach implemented by the FMA unit, and the digit-recurrence approach. The digit-recurrence approach is shown to be more power and energy efficient [11] .
The low power dividers can also be used to act as heat spreaders for the FMAs. In this way, we can limit the amount of heat diffusion from the FMAs to leakage dominant blocks such as on chip caches. For example, in Nvidia's latest CUDA achitecture Fermi [10] each streaming multiprocessor contains 32 cores and 64 KB of L1 cache where there is one FMA unit in each core. The excessive heat generated in the FMAs can cause temperature rise in the L1 cache and induce high leakage. By placing the low power dividers between the FMAs and the cache, we can reduce the average temperature in the cache. The leakage power which dominates power consumptions in caches can be significantly reduced by decreasing their temperature. To study the amount of leakage reduction that can be obtained from lowering the temperature, we developed a temperature dependent leakage model by characterizing a 65 nm standard cell library.
II. POWER DISSIPATION IN FLOATING-POINT UNITS
In this section we briefly summarize the work of [11] where power and energy dissipation in floating point units are analyzed.
A. Fused Multiply-Add Unit
A Fused Multiply-Add (FMA) unit implements the floatingpoint operation A + B × C .
The FMA unit can also be utilized to implement division based on the Newton-Raphson approximation [12] . The quotient q of the division can be computed by multiplication of the reciprocal of d and the dividend x
This is implemented by the approximation of the reciprocal R = 1/d, followed by the multiplication q = R · x. By determining R[0] as the first approximation of 1/d, R can be approximated in m steps
Each iteration requires two multiplications and one subtraction. To have rounding compliant with the IEEE standard, extra iterations are required to compute the remainder and perform the rounding according to the specified mode [12] . The division operation can be implemented either through utilizing the existing FMA instruction (by having a software implementation of the NR algorithm), or by augmenting the FMA unit with an initial approximation table and by having intermediate results by-pass un-necessary pipeline stages and the register file. We designed a FMA unit with four pipeline stages and a FMA variant with built-in hardware support for division (a lookup table provides an 8-bit initial approximation of the reciprocal).
The number of clock cycles required for the hardware implementation of the binary64 division algorithm on the augmented FMA, including remainder calculation and rounding, is 18.
On the other hand, for the software implementation running on a plain FMA, a binary64 division requires 41 cycles.
B. Division by digit-recurrence
The digit-recurrence algorithm [13] is a direct method to compute the quotient of the division. The radix-r digitrecurrence division algorithm for binary64 (double-precision) significands is implemented by the residual recurrence
with the initial value w[0] = x and the quotient-digit q j+1 , normally in signed-digit format, which provides log 2 r bits of the quotient at each iteration. The quotient-digit selection is
where d δ is d truncated after the δ-th fractional bit and the estimated residual, y = rw[j] t , is truncated after t fractional bits. To simplify the q j+1 selection a signed-digit redundant representation is chosen. Both δ and t depend on the radix and the redundancy (a). For a radix 16 division algorithm, the total number of iterations to compute a binary64 division, including initialization and rounding, is 18.
C. Power and Energy Consumptions in FP Operations
To analyze the impact on power dissipation of the different units and to evaluate the different approaches to division, we implemented two units for binary64: FMA and r16div. The FMA is a fused multiply-add unit modified to execute the NR division algorithm. r16div is a radix-16 divide unit completed with convert-and-round unit and sign and exponent computation and update.
The units are synthesized by Synopsys's Design Compiler with a 65 nm standard cell library and are laid-out by Synopsys's IC Compiler. Power is estimated by Synopsys's Power Compiler based on randomly generated input vectors conformed to IEEE 754 binary64 format. The synthesis results are summarized in Table I . The power dissipation data for the FMA are divided by operation.
The data obtained for the software implementation of the NR iterations using FMA instructions on a unmodified FMA unit, FMA soft, are reported in the fifth row of Table I . Pave is average power measured at 1 GHz. * Iterative algorithm, pipeline not full.
TABLE I RESULTS OF IMPLEMENTATIONS.
The power dissipation is normalized for all units at 1 GHz. For the three operations: ADD, MUL and MA fused, the power was measured with the pipeline full to get the worst case power dissipation necessary to characterize the thermal behavior (Section IV) of the units. For division, being an iterative algorithm, the power was measured per operation. This explains why the value P ave for FMA DIV is smaller than the other FMA cases.
As for floating-point division, from the data of Table I it is clear that the digit-recurrence approach (r16div) is more convenient in terms of latency, area and power dissipation. The only argument in favor of the FMA DIV is that because division is much less frequent than addition and multiplication a larger power dissipation for the operation can be tolerated.
Since the software implementation of division has a much longer latency, in all the experiment results shown hereafter we refer to the modified FMA with hardware support for division.
In [14] , the average frequency of floating-point operations in the SPECfp92 benchmark suite is reported. The most common instructions are multiply and add with MULT accounting for 37% and ADD for 55%. Moreover, the FP adder consumes nearly 50% of the multiply results which explains why fused multiply-add units are often used in modern processors. operation (E op ) which is defined as
The values of E op (at 1 GHz) for the FP-units implemented are listed in the last column in Table I .
Due to the significant reduction in E op for division, we propose to use a digit-recurrence (e.g. r16div) divider for FPdivision. To compare the energy savings between division by FMA and by digit-recurrence, we use the SPICE benchmark which has a rather high percentage of divisions [15] .
We list the results of the comparison in Table III to show the upper and lower bound of the energy consumption. In Table  III , the results are obtained by assuming none of the MULT instructions can be fused with the ADD (column 2-4) and by assuming all MULT can be fused with the ADD (column 5-7). The MULT and ADD operations are implemented by the FMA unit. The DIV operation is implemented either in r16div or in FMA. In both cases (fused MA or not) there is a significant reduction (larger than 30%) in energy consumption by using a digit-recurrence divider to implement binary64 division.
III. LEAKAGE AND THERMAL MODELS

A. Leakage Power Model
The leakage power is mainly due to gate tunneling and subthreshold leakage and has a large temperature dependency. According to ITRS, various low power techniques such as dynamic Vt, multiple Vt transistors, power domains/voltage islands, body bias, etc. will mitigate leakage until 2012. As the use of high-k dielectrics brings gate leakage under control, subthreshold leakage is going to dominate and limit performance [1] .
While gate leakage is relatively temperature independent, subthreshold leakage has an exponential dependece on temperature. The SPICE BSIM4 model is a very elaborate model of modern transistors behavior. We use the BSIM4 models provided by the standard cell library to characterize the temperature dependecy of the cell's leakage power. Points in Figure 1 show the normalized leakage reported by SPICE as temperature is swept from 20
• C to 150
• C, which is a typical range of operating temperatures. In order to approximate the exponential increase of leakage, we use a fourth order polynomial to accurately fit the SPICE reported data. The model shall be used in the next section to calculate the amount of leakage reduction obtained from decrease in temperature. The polynomial is plotted in Figure 1 using a curve. As shown in the figure, leakage power more than doubles for every 30
• C rise in temperature. At high temperatures the rate of increase in leakage power is very fast, which means that even a few degrees of increase in temperature can induce a large amount of leakage power. 
B. Thermal Model
Heat generated inside a chip is dissipated to the ambient air through the silicon substrate and the package. The excessive heat is mainly due to the power consumed in different units in the device layer. Therefore, accurate thermal modeling requires a 3-D analysis of the various components of a chip.
To perform thermal analysis, we use the model proposed in [16] , which consists of a conventional RC model of the heat conduction paths around each thermal element. The differential equation modeling heat transfer according to the Fourier's law is solved by first transforming it to a difference equation, and then using SPICE to solve the equivalent RC electrical network.
A circuit is meshed into three-dimensional thermal cells. The z direction is discretized into 9 layers. Cells inside the grid are connected to each other while cells on the boundary are connected to voltage sources that model the ambient temperature. Power consumption in cells is modeled by current sources. For steady-state analysis, the RC model reduces to a netlist of thermal resistances and current sources.
In our thermal model, we adopted the thermal conductivities of different layers from [17] .
IV. LEAKAGE OPTIMIZATION IN CACHES
In Section II, we show that energy can be significantly reduced by using a digit-recurrence unit for division operations. The placement of these digit-recurrence units can be exploited to limit the amount of heat diffusion from hot blocks such as FMAs to caches. Leakage is the dominant fraction of power consumption in caches, so cache is more sensitive to temperature increase in terms of power. In our experiments we made the assumption based on [2] that 70% of the total power consumption in cache is from leakage. Three experiments are done for different configurations of FMA blocks, r16div blocks and a cache block. The size of the cache block is 875μm × 160μm which can accommodate approximately 16 KB in a 65 nm process according to our estimation. The area of a r16div as reported in Table I is roughly 1/8 of a FMA unit.
In Figure 2 we show the impact on temperature distribution when two r16div units are placed in between the FMAs and a cache block. Temperatures shown in the figure indicate the rise above the ambient temperature which is 50
• C in our experiments. Power consumption in the FMA and div units are estimated based on workload characterized by the instruction mix with fused MA as shown in Table II . The right figure has more thermal cells in the grid due to its increased area.
The area occupied by the FMAs has a higher temperature which is reflected by the dark color. The div units reduce the average temperature rise in the cache block from 22.8
• C to 20.1
• C. This means the absolute temperature in the cache block reduces from 72.8
• C to 70.1 • C. From our temperature dependent leakage model, the decrease results in a 6.77% reduction in leakage. The total power consumption in the cache block is therefore reduced by 4.74%.
In Figure 3 we show a larger circuit composed of a cache block and four FMA units. The size of the cache block is the same as before. Again, we use separate dividers for division operations to save energy and reduce average temperature in cache. In test2, we first try to use two dividers as shown in Figure 3 left where the average temperature in the cache block dropped by 2.3
• C and leakage is reduced by 5.69%. Next, we use four dividers in test3 as shown in Figure 3 right. Due to the larger area introduced by the dividers, the average temperature rise in the cache block is reduced by 5
• C from 25.6
• C to 20.6
• C. The leakage is reduced by 12% and the total power consumption decreased by 8.44% in the cache. Table IV summarizes the experiment results where T 1 is the average temperature rise above the ambient in the cache when no dividers are used and T 2 is the average temperature rise in the cache after dividers are inserted. ΔT is the temperature difference between T 1 and T 2 . ΔP leak and ΔP cache are the percentage of leakage and total power reduction in the cache block due to the decrease in temperature.
In Table V by digit-recurrence. This time the savings in energy not only comes from division instructions but also from leakage reduction in cache. Energy per operation in Table V is larger than  in Table III because the operating temperature obtained from thermal analysis is taken into consideration when calculating leakage power for the FMA and r16div units. P leak in cache is obtained by extrapolating P leak in FMA based on the assumption that leakage is proportional to area. The overall energy savings is quite close to that in Table III as for the whole system dynamic power still dominates.
It is obvious that by using more divider units we can obtain a larger reduction in cache temperature and thus leakage. However, the cost is increased die area which might not be desirable. It should be noted that the divider units are power efficient components for division operations instead of plain empty space. Given the power and thermal properties of the divider block, designers should decide the number of units to use based on the frequencies of division operations in the application.
V. CONCLUSION
In this paper, we show that although a Fused MultiplyAdd (FMA) unit can also perform division operations its power consumption is much larger than a digit-recurrence divider. Implementing division in a low power divider can save a significant amount of energy even though divsion is a less frequent operation. The leakage power, being temperature sensitive can also benefit from the use of power efficient dividers. Due to the reduction of heat diffusion from hot FMAs, we show that the total power consumption in on-chip cache blocks can be reduced by as much as 8%.
